SorenKarst / longread_umi

GNU General Public License v3.0
76 stars 29 forks source link

Empty file umi12f.fa when when running usearch in longread_umi nanopore_pipeline #53

Open mortenry opened 2 years ago

mortenry commented 2 years ago

Hi

I am trying to run the longread_umi nanopore_pipeline with ONP data from a collaborator. However, after the adaptor/primer trimming step, the following error appears when the pipeline runs the usearch command:

usearch11.0.667_i86linux32 -fastx_uniques test_mb0/umi_binning/umi_ref/umi12f.fa -fastaout test_mb0/umi_binning/umi_ref/umi12u.fa -sizeout -minuniquesize 1 -relabel umi -strand both

---Fatal error--- Empty file test_mb0/umi_binning/umi_ref/umi12f.fa usearch v11.0.667_i86linux32, 4.0Gb RAM (396Gb total), 72 cores (C) Copyright 2013-18 Robert C. Edgar, all rights reserved. https://drive5.com/usearch

Apparantly the umi12f.fa file is empty, and the pipeline fails to proceed.

I cannot find out why this is happening (or why the file is empty). The pipeline worked fine on the test data (both R9.4.1 and R10).

I have attached the output from the pipeline before the usearch command. I would highly appreciate any suggestions, and please let me know if you need more information.

best regards Morten Rye

Program_Output_umi12f-empty.txt

rob234king commented 2 years ago

What your output of perfect_trim.log? I have below so not sure as used fast model that too many errors in my reads. But resulting files umi1.fq is empty? Seems to be the cutadapt command that is failing for me. I'm guessing the same as me?

Total read pairs processed: 16,766 Read 1 with adapter: 74 (0.4%) Read 2 with adapter: 63 (0.4%) Pairs that were too short: 127 (0.8%) Pairs that were too long: 16,639 (99.2%) Pairs written (passing filters): 0 (0.0%)

Total basepairs processed: 6,706,400 bp Read 1: 3,353,200 bp Read 2: 3,353,200 bp Total written (filtered): 0 bp (0.0%) Read 1: 0 bp Read 2: 0 bp

rob234king commented 2 years ago

looks like some too long filter, so maybe changing --maximum-length parameter in cutadapt?

rob234king commented 2 years ago

putative trim did better but lost a lot. This output should be overwritten by the perfect trim which is why the file is empty as perfect trim fails. Total read pairs processed: 16,766 Read 1 with adapter: 13,114 (78.2%) Read 2 with adapter: 7,593 (45.3%) Pairs that were too short: 134 (0.8%) Pairs written (passing filters): 5,889 (35.1%)

Total basepairs processed: 6,706,400 bp Read 1: 3,353,200 bp Read 2: 3,353,200 bp Total written (filtered): 212,004 bp (3.2%) Read 1: 106,002 bp Read 2: 106,002 bp

rob234king commented 2 years ago

relaxing e value and max length and min length and more comes through the filter

rob234king commented 2 years ago

This is what I have so maybe a mistake here. Looks like cutadapt process is not working as intended, likely due to my misunderstanding of something.

Adapter number adapter Top strand bottom strand Top strand sequence Bottom Strand sequence
PCA 2.05 13xN ONLA30677 ONLA30678 TTTCTGTTGGTGCTGATATTGCNNNNNNNNNNNNNGGCGTCTGCTTGGGTGTTTAACCT /5phos/GGTTAAACACCCAAGCAGACGCCNNNNNNNNNNNNNGAAGATAGAGCGACAGGCAAGT

-f GGCGTCTGCTTGGGTGTTTAACCT \ -F TTTCTGTTGGTGCTGATATTGC \ -r GGTTAAACACCCAAGCAGACGCC \ -R GAAGATAGAGCGACAGGCAAGT \

mortenry commented 2 years ago

Thanks for input. Good idea to check "perfect_trim.log". It looks like this:

This is cutadapt 2.7 with Python 3.6.10 Command line parameters: -j 1 -e 0.2 -O 11 -m 18 -M 18 --discard-untrimmed -g GTATCGTGTAGAGACTGCGTAGG...TCAAGCCTCAGACAGTGGTTC -g AGTGATCGAGTCAGTGCGAGTG...CATAGCGTAAAAGGAGCAACA -G TGTTGCTCCTTTTACGCTA TG...CACTCGCACTGACTCGATCACT -G GAACCACTGTCTGAGGCTTGA...CCTACGCAGTCTCTACACGATAC -o test_mb0/umi_binning/umi_ref/umi1.fq -p test_mb0/umi_binning/umi_ref/umi2.fq test_mb0/umi_binning/umi_ref/reads_tf_s tart.fq test_mb0/umi_binning/umi_ref/reads_tf_end.fq Processing reads on 1 core in paired-end mode ... Finished in 0.21 s (67 us/read; 0.89 M reads/minute).

=== Summary ===

Total read pairs processed: 3,110 Read 1 with adapter: 2,036 (65.5%) Read 2 with adapter: 1,557 (50.1%) Pairs that were too short: 17 (0.5%) Pairs that were too long: 3,093 (99.5%) Pairs written (passing filters): 0 (0.0%)

Total basepairs processed: 1,217,910 bp Read 1: 608,955 bp Read 2: 608,955 bp Total written (filtered): 0 bp (0.0%) Read 1: 0 bp Read 2: 0 bp

So it seems to be the problem that no reads pass the filter. However this does not change, even when I change to -m 10, -M 10000 and -e 10. It's also interesting to see the cutadapt commandline at the top where -m and -M are set to 18, but maybe it is supposed to be like this, since the UMI-length between the adapter/primer is 18. I don't know when the input -m and -M are used, so I am still confused.

Run command: longread_umi nanopore_pipeline \ -d fastq_runid_25dec2b6d463d9adc700c13264282f7c49ac4b50_0_0.fastq \ -o test_mb0 \ -v 25 \ -q r10_min_high_g340 \ -m 10 \ -M 100000 \ -s 100 \ -e 10 \ -f GTATCGTGTAGAGACTGCGTAGG \ -F TCAAGCCTCAGACAGTGGTTC \ -r AGTGATCGAGTCAGTGCGAGTG \ -R CATAGCGTAAAAGGAGCAACA \ -c 2 \ -p 2 \ -t 1 \ -T 1

Morten

rob234king commented 2 years ago

you can not have an e value greater than 1 so would need to be 0.9 and should see them come through but this means 90% error on match so not a solution but something is not right with the cutadapt command. Either primer/adapter combinations are not right or something. Did you have UMI at one end or both?

mortenry commented 2 years ago

I understand now that you meant the e value in cutadapt and not the -e in the "longread_umi nanopore_pipeline" setup. The e in cutadapt cannot be specified as input (looks like defualt is 0.2 in the pipeline). I got the primer/adapter inputs for the workflow from my collaborators. The excel-file they used to define them these look like this:

Oligo Name Sequence 5' to 3' (include modification codes if applicable) Scale (µmole) Purification
GSP UMI fwd GTATCGTGTAGAGACTGCGTAGGTTTVVVVTTVVVVTTVVVVTTVVVVTTTtcaagcctcagacagtggttc 0.05 DST
GSP UMI rev AGTGATCGAGTCAGTGCGAGTGTTTVVVVTTVVVVTTVVVVTTVVVVTTTCATAGCGTAAAAGGAGCAACA 0.05 DST
UVP fwd GGTGCTGAAGAAAGTTGTCGGTGTCTTTGTGTTAACCGTATCGTGTAGAGACTGCGTAGG 0.05 DST
UVP rev GGTGCTGAAGAAAGTTGTCGGTGTCTTTGTGTTAACCAGTGATCGAGTCAGTGCGAGTG 0.05 DST

I am not an expert on the structures, but there is of course a possibility that something has been defined wrongly.

Morten