Closed anoronh4 closed 1 year ago
Hmmm.. That's strange and I can't reproduce with umi_tools 1.1.3
. From the tests directory, if I run the following for paired end and then single end extraction.
umi_tools extract --extract-method=string --read2-in=scrb_seq_fastq.2.gz --bc-pattern=NNNNNNNNNN -L test.log --stdin=scrb_seq_fastq.1.gz --read2-out=delete_me.2.fastq > delete_me.1.fastq
umi_tools extract --extract-method=string --bc-pattern=NNNNNNNNNN -L test.log --stdin=scrb_seq_fastq.1.gz > delete_me.fastq
The headers have the expected format in both cases:
==> delete_me.1.fastq <==
@SRR1058032.1_AATAACTTCC HISEQ:653:H12WDADXX:1:1101:1210:2217 length=17
==> delete_me.2.fastq <==
@SRR1058032.1_AATAACTTCC HISEQ:653:H12WDADXX:1:1101:1210:2217 length=34
==> delete_me.fastq <==
@SRR1058032.1_AATAACTTCC HISEQ:653:H12WDADXX:1:1101:1210:2217 length=17
Could you please try updating to 1.1.3
to see if this resolves the issue. There was a small update to allow a user-defined umi separator in the output (https://github.com/CGATOxford/UMI-tools/pull/548), so it's possible the behaviour has changed, but our interated regression testing didn't pick that up if so 🤷
Sorry for not seeing your reply earlier. I continue to see the issue with both 1.1.3 and 1.1.4. my command is
umi_tools extract \
-I test_rnaseq_1.fastq.gz \
-S SAMPLE_SINGLE_END_UMI@SRR5665260.1.36873006@@SAMPLE_SINGLE_END_UMI_T1.umi_extract.fastq.gz \
--bc-pattern="NNN" --ignore-read-pair-suffixes
You can grab this same input to test here: https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_rnaseq_1.fastq.gz . i think in the example you used the header might have been formatted differently than in mine -- there is no flowcell information in the header in my case.
Thanks for the example 😁 I get the same output.
While I look into it, you can try using the --readNameSeparator
option with STAR. The manual states:
default: /
string(s): character(s) separating the part of the read names that will be
trimmed in output (read name after space is always trimmed)```
So setting that to a space if possible, or some other character that shouldn't exist (e.g ':') should work as an immediate solution.
umi_tools
needs to ensure that the read names for paired reads are the same after adding the UMI or else downstream tools will throw errors. The information after the first space in field 1 of a fastq entry is not considered part of the read name as is optional (fastq format). Hence, umi_tools
adds the UMIs to the end of the first part of field 1.
I think I'm right in saying there is no strict convention for denoting read 1/2 in field 1. Usually it's e.g /1
at the very end of the field with at least one space in between the read name and the read number. However, sometimes, field 1 can contain the read number at the end of the read name. This is why star
removes everything after the '/' and why umi_tools
allows the user to ignore the read suffix when checking that the two reads have the same name (--ignore-read-pair-suffixes
; https://github.com/CGATOxford/UMI-tools/issues/325). This option will also remove the read pair suffixes (/1
/ /2
) from the read name to ensure they are identical.
In your case, you have a paired end read file with the read numbers as part of the read name. As you show above, this is handled appropriately with --ignore-read-pair-suffixes
when processing both reads, but not when processing just one read. The obvious question is why do you want to use just one read? Given that you do though, it would be possible to make umi_tools
remove the read pair suffix in single end mode as well.
@IanSudbery, would just need to add the equivalent of https://github.com/CGATOxford/UMI-tools/blob/d98ebace3a096bdd06dbe2d4feb0388ccf13d0b1/umi_tools/umi_methods.py#L119-L132 to the single end fastq parser. Wouldn't add any overhead when the option is not used and mimimal when used. What do you think?
Not quite that simple. There isn't a single end fastq parser that is the equivalent of the joinedFastqParser. This is presumably where the problem is coming from. The structure here is that the joinedParser is called with two fastq iteratrors as parameters. In the single-end case there is no parser, just the raw iterator.
I don't think there should be any problem with moving the suffix code to the iterator. Except that I'll need to make sure it always gets called correctly.
Ah, yeah, I was scanning the code too lightly. Nice work on the PR 👍
i installed the latest updates including this PR and unfortunately it errored out.
$ umi_tools extract -I test_rnaseq_1.fastq.gz -S extract.fastq.gz --bc-pattern="NNN" --ignore-read-pair-suffixes
# UMI-tools version: 1.1.4
# output generated by extract -I test_rnaseq_1.fastq.gz -S extract.fastq.gz --bc-pattern=NNN --ignore-read-pair-suffixes
# job started at Thu Apr 13 12:19:49 2023 on terra -- 786c956e-8418-4a4a-a3e9-493934d944b7
# pid: 15240, system: Linux 3.10.0-1160.45.1.el7.x86_64 #1 SMP Wed Oct 13 17:20:51 UTC 2021 x86_64
# blacklist : None
# compresslevel : 6
# correct_umi_threshold : 0
# either_read : False
# either_read_resolve : discard
# error_correct_cell : False
# extract_method : string
# filter_cell_barcode : None
# filter_cell_barcodes : False
# filter_umi : None
# filtered_out : None
# filtered_out2 : None
# ignore_suffix : True
# log2stderr : False
# loglevel : 1
# pattern : NNN
# pattern2 : None
# prime3 : None
# quality_encoding : None
# quality_filter_mask : None
# quality_filter_threshold : None
# random_seed : None
# read2_in : None
# read2_out : False
# read2_stdout : False
# reads_subset : None
# reconcile : False
# retain_umi : None
# short_help : None
# stderr : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
# stdin : <_io.TextIOWrapper name='test_rnaseq_1.fastq.gz' encoding='ascii'>
# stdlog : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
# stdout : <_io.TextIOWrapper name='extract.fastq.gz' encoding='ascii'>
# timeit_file : None
# timeit_header : None
# timeit_name : all
# tmpdir : None
# umi_correct_log : None
# umi_separator : _
# umi_whitelist : None
# umi_whitelist_paired : None
# whitelist : None
2023-04-13 12:19:49,025 INFO Starting barcode extraction
Traceback (most recent call last):
File "/home/noronhaa/anaconda3/envs/umitools_test/bin/umi_tools", line 33, in <module>
sys.exit(load_entry_point('umi-tools==1.1.4', 'console_scripts', 'umi_tools')())
File "/home/noronhaa/anaconda3/envs/umitools_test/lib/python3.6/site-packages/umi_tools-1.1.4-py3.6-linux-x86_64.egg/umi_tools/umi_tools.py", line 66, in main
module.main(sys.argv)
File "/home/noronhaa/anaconda3/envs/umitools_test/lib/python3.6/site-packages/umi_tools-1.1.4-py3.6-linux-x86_64.egg/umi_tools/extract.py", line 439, in main
for read in read1s:
File "/home/noronhaa/anaconda3/envs/umitools_test/lib/python3.6/site-packages/umi_tools-1.1.4-py3.6-linux-x86_64.egg/umi_tools/umi_methods.py", line 116, in fastqIterate
line1 = removeReadIDSuffix(line1)
File "/home/noronhaa/anaconda3/envs/umitools_test/lib/python3.6/site-packages/umi_tools-1.1.4-py3.6-linux-x86_64.egg/umi_tools/umi_methods.py", line 92, in removeReadIDSuffix
'read suffix must be /1 or /2. Observed: %s' % read_id[-2:])
ValueError: read suffix must be /1 or /2. Observed: 1
is there any update here? i've tested out the master branch by installing from a clone into my conda environment (python setup.py install
) and then tried to compare the behaviors of the two. seems like there's a problem now with paired end fastqs as well. here's a breakdown of what i see:
version | type | error | correct header | command |
---|---|---|---|---|
1.1.4 | PE | none | true | umi_tools extract -I test_rnaseq_1.fastq.gz --read2-in=test_rnaseq_2.fastq.gz -S test_rnaseq_1.umi_extract.fastq.gz --read2-out=test_rnaseq_2.umi_extract.fastq.gz --bc-pattern="NNNXX" --ignore-read-pair-suffixes |
1.1.4 | SE | none | false | umi_tools extract -I test_rnaseq_1.fastq.gz -S test_rnaseq_1.umi_extract.fastq.gz --bc-pattern="NNNXX" --ignore-read-pair-suffixes |
master | PE | ValueError: read suffix must be /1 or /2. Observed: 1 | N/A | umi_tools extract -I test_rnaseq_1.fastq.gz --read2-in=test_rnaseq_2.fastq.gz -S test_rnaseq_1.umi_extract.fastq.gz --read2-out=test_rnaseq_2.umi_extract.fastq.gz --bc-pattern="NNNXX" --ignore-read-pair-suffixes |
master | SE | ValueError: read suffix must be /1 or /2. Observed: 1 | N/A | umi_tools extract -I test_rnaseq_1.fastq.gz -S test_rnaseq_1.umi_extract.fastq.gz --bc-pattern="NNNXX" --ignore-read-pair-suffixes |
so now both PE and SE are failing in the latest code.
to use the same data as me, you can grab them using the URLs:
wget https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_rnaseq_1.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/modules/data/genomics/homo_sapiens/illumina/fastq/test_rnaseq_2.fastq.gz
Hi Anna,
Looks like I fixed this back in april and then never merged the fix :(. Sorry.
ok great! i thought the fix was in master because #591 was merged already, and didn't realize any more updates had been planned. i tested out the unmerged branch, looks like it resolves the issue i was experiencing at least.
I have noticed inconsistent behavior in how
umi_tools extract
handles the /1 suffix depending on whether the inputs are paired end or single end. when the input is paired-end read id is changed from@SRR5665260.1.36873006/1
to@SRR5665260.1.36873006_CAG
, basically removing the read number as it separates them into two distinct files. If I take only the read1 fastq file and runumi_tools extract
, the same read becomes@SRR5665260.1.36873006/1_CAG
, even though i am using--ignore-read-pair-suffixes
in both cases. I am using v1.1.2.This is causing issues downstream for me when i run alignment with STAR. STAR discards everything after
/
in the read ID. just wondering if this can be fixed on theumi_tools
end.