CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

bug for handling /1 suffix in single-ended reads #580

Closed anoronh4 closed 1 year ago

anoronh4 commented 1 year ago

I have noticed inconsistent behavior in how umi_tools extract handles the /1 suffix depending on whether the inputs are paired end or single end. when the input is paired-end read id is changed from @SRR5665260.1.36873006/1 to @SRR5665260.1.36873006_CAG, basically removing the read number as it separates them into two distinct files. If I take only the read1 fastq file and run umi_tools extract, the same read becomes @SRR5665260.1.36873006/1_CAG, even though i am using --ignore-read-pair-suffixes in both cases. I am using v1.1.2.

This is causing issues downstream for me when i run alignment with STAR. STAR discards everything after / in the read ID. just wondering if this can be fixed on the umi_tools end.

TomSmithCGAT commented 1 year ago

Hmmm.. That's strange and I can't reproduce with umi_tools 1.1.3. From the tests directory, if I run the following for paired end and then single end extraction.

umi_tools extract  --extract-method=string --read2-in=scrb_seq_fastq.2.gz --bc-pattern=NNNNNNNNNN -L test.log  --stdin=scrb_seq_fastq.1.gz --read2-out=delete_me.2.fastq > delete_me.1.fastq

umi_tools extract  --extract-method=string --bc-pattern=NNNNNNNNNN -L test.log  --stdin=scrb_seq_fastq.1.gz > delete_me.fastq

The headers have the expected format in both cases:

==> delete_me.1.fastq <==
@SRR1058032.1_AATAACTTCC HISEQ:653:H12WDADXX:1:1101:1210:2217 length=17

==> delete_me.2.fastq <==
@SRR1058032.1_AATAACTTCC HISEQ:653:H12WDADXX:1:1101:1210:2217 length=34

==> delete_me.fastq <==
@SRR1058032.1_AATAACTTCC HISEQ:653:H12WDADXX:1:1101:1210:2217 length=17
TomSmithCGAT commented 1 year ago

Could you please try updating to 1.1.3 to see if this resolves the issue. There was a small update to allow a user-defined umi separator in the output (https://github.com/CGATOxford/UMI-tools/pull/548), so it's possible the behaviour has changed, but our interated regression testing didn't pick that up if so 🤷

anoronh4 commented 1 year ago

Sorry for not seeing your reply earlier. I continue to see the issue with both 1.1.3 and 1.1.4. my command is

umi_tools extract \
    -I test_rnaseq_1.fastq.gz \
    -S SAMPLE_SINGLE_END_UMI@SRR5665260.1.36873006@@SAMPLE_SINGLE_END_UMI_T1.umi_extract.fastq.gz \
    --bc-pattern="NNN"  --ignore-read-pair-suffixes

You can grab this same input to test here: https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_rnaseq_1.fastq.gz . i think in the example you used the header might have been formatted differently than in mine -- there is no flowcell information in the header in my case.

TomSmithCGAT commented 1 year ago

Thanks for the example 😁 I get the same output.

While I look into it, you can try using the --readNameSeparator option with STAR. The manual states:


default: /
string(s): character(s) separating the part of the read names that will be
trimmed in output (read name after space is always trimmed)```

So setting that to a space if possible, or some other character that shouldn't exist (e.g ':') should work as an immediate solution.
TomSmithCGAT commented 1 year ago

umi_tools needs to ensure that the read names for paired reads are the same after adding the UMI or else downstream tools will throw errors. The information after the first space in field 1 of a fastq entry is not considered part of the read name as is optional (fastq format). Hence, umi_tools adds the UMIs to the end of the first part of field 1.

I think I'm right in saying there is no strict convention for denoting read 1/2 in field 1. Usually it's e.g /1 at the very end of the field with at least one space in between the read name and the read number. However, sometimes, field 1 can contain the read number at the end of the read name. This is why star removes everything after the '/' and why umi_tools allows the user to ignore the read suffix when checking that the two reads have the same name (--ignore-read-pair-suffixes; https://github.com/CGATOxford/UMI-tools/issues/325). This option will also remove the read pair suffixes (/1 / /2) from the read name to ensure they are identical.

In your case, you have a paired end read file with the read numbers as part of the read name. As you show above, this is handled appropriately with --ignore-read-pair-suffixes when processing both reads, but not when processing just one read. The obvious question is why do you want to use just one read? Given that you do though, it would be possible to make umi_tools remove the read pair suffix in single end mode as well.

@IanSudbery, would just need to add the equivalent of https://github.com/CGATOxford/UMI-tools/blob/d98ebace3a096bdd06dbe2d4feb0388ccf13d0b1/umi_tools/umi_methods.py#L119-L132 to the single end fastq parser. Wouldn't add any overhead when the option is not used and mimimal when used. What do you think?

IanSudbery commented 1 year ago

Not quite that simple. There isn't a single end fastq parser that is the equivalent of the joinedFastqParser. This is presumably where the problem is coming from. The structure here is that the joinedParser is called with two fastq iteratrors as parameters. In the single-end case there is no parser, just the raw iterator.

I don't think there should be any problem with moving the suffix code to the iterator. Except that I'll need to make sure it always gets called correctly.

TomSmithCGAT commented 1 year ago

Ah, yeah, I was scanning the code too lightly. Nice work on the PR 👍

anoronh4 commented 1 year ago

i installed the latest updates including this PR and unfortunately it errored out.

$ umi_tools extract     -I test_rnaseq_1.fastq.gz     -S extract.fastq.gz     --bc-pattern="NNN"  --ignore-read-pair-suffixes
# UMI-tools version: 1.1.4
# output generated by extract -I test_rnaseq_1.fastq.gz -S extract.fastq.gz --bc-pattern=NNN --ignore-read-pair-suffixes
# job started at Thu Apr 13 12:19:49 2023 on terra -- 786c956e-8418-4a4a-a3e9-493934d944b7
# pid: 15240, system: Linux 3.10.0-1160.45.1.el7.x86_64 #1 SMP Wed Oct 13 17:20:51 UTC 2021 x86_64
# blacklist                               : None
# compresslevel                           : 6
# correct_umi_threshold                   : 0
# either_read                             : False
# either_read_resolve                     : discard
# error_correct_cell                      : False
# extract_method                          : string
# filter_cell_barcode                     : None
# filter_cell_barcodes                    : False
# filter_umi                              : None
# filtered_out                            : None
# filtered_out2                           : None
# ignore_suffix                           : True
# log2stderr                              : False
# loglevel                                : 1
# pattern                                 : NNN
# pattern2                                : None
# prime3                                  : None
# quality_encoding                        : None
# quality_filter_mask                     : None
# quality_filter_threshold                : None
# random_seed                             : None
# read2_in                                : None
# read2_out                               : False
# read2_stdout                            : False
# reads_subset                            : None
# reconcile                               : False
# retain_umi                              : None
# short_help                              : None
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
# stdin                                   : <_io.TextIOWrapper name='test_rnaseq_1.fastq.gz' encoding='ascii'>
# stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
# stdout                                  : <_io.TextIOWrapper name='extract.fastq.gz' encoding='ascii'>
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
# umi_correct_log                         : None
# umi_separator                           : _
# umi_whitelist                           : None
# umi_whitelist_paired                    : None
# whitelist                               : None
2023-04-13 12:19:49,025 INFO Starting barcode extraction
Traceback (most recent call last):
  File "/home/noronhaa/anaconda3/envs/umitools_test/bin/umi_tools", line 33, in <module>
    sys.exit(load_entry_point('umi-tools==1.1.4', 'console_scripts', 'umi_tools')())
  File "/home/noronhaa/anaconda3/envs/umitools_test/lib/python3.6/site-packages/umi_tools-1.1.4-py3.6-linux-x86_64.egg/umi_tools/umi_tools.py", line 66, in main
    module.main(sys.argv)
  File "/home/noronhaa/anaconda3/envs/umitools_test/lib/python3.6/site-packages/umi_tools-1.1.4-py3.6-linux-x86_64.egg/umi_tools/extract.py", line 439, in main
    for read in read1s:
  File "/home/noronhaa/anaconda3/envs/umitools_test/lib/python3.6/site-packages/umi_tools-1.1.4-py3.6-linux-x86_64.egg/umi_tools/umi_methods.py", line 116, in fastqIterate
    line1 = removeReadIDSuffix(line1)
  File "/home/noronhaa/anaconda3/envs/umitools_test/lib/python3.6/site-packages/umi_tools-1.1.4-py3.6-linux-x86_64.egg/umi_tools/umi_methods.py", line 92, in removeReadIDSuffix
    'read suffix must be /1 or /2. Observed: %s' % read_id[-2:])
ValueError: read suffix must be /1 or /2. Observed: 1
anoronh4 commented 1 year ago

is there any update here? i've tested out the master branch by installing from a clone into my conda environment (python setup.py install) and then tried to compare the behaviors of the two. seems like there's a problem now with paired end fastqs as well. here's a breakdown of what i see:

version type error correct header command
1.1.4 PE none true umi_tools extract -I test_rnaseq_1.fastq.gz --read2-in=test_rnaseq_2.fastq.gz -S test_rnaseq_1.umi_extract.fastq.gz --read2-out=test_rnaseq_2.umi_extract.fastq.gz --bc-pattern="NNNXX" --ignore-read-pair-suffixes
1.1.4 SE none false umi_tools extract -I test_rnaseq_1.fastq.gz -S test_rnaseq_1.umi_extract.fastq.gz --bc-pattern="NNNXX" --ignore-read-pair-suffixes
master PE ValueError: read suffix must be /1 or /2. Observed: 1 N/A umi_tools extract -I test_rnaseq_1.fastq.gz --read2-in=test_rnaseq_2.fastq.gz -S test_rnaseq_1.umi_extract.fastq.gz --read2-out=test_rnaseq_2.umi_extract.fastq.gz --bc-pattern="NNNXX" --ignore-read-pair-suffixes
master SE ValueError: read suffix must be /1 or /2. Observed: 1 N/A umi_tools extract -I test_rnaseq_1.fastq.gz -S test_rnaseq_1.umi_extract.fastq.gz --bc-pattern="NNNXX" --ignore-read-pair-suffixes

so now both PE and SE are failing in the latest code.

to use the same data as me, you can grab them using the URLs:

wget https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_rnaseq_1.fastq.gz 
wget https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_rnaseq_2.fastq.gz 
IanSudbery commented 1 year ago

Hi Anna,

Looks like I fixed this back in april and then never merged the fix :(. Sorry.

anoronh4 commented 1 year ago

ok great! i thought the fix was in master because #591 was merged already, and didn't realize any more updates had been planned. i tested out the unmerged branch, looks like it resolves the issue i was experiencing at least.