SorenKarst / longread_umi

GNU General Public License v3.0
76 stars 28 forks source link

fatal error empty file ... #35

Open cliu32 opened 4 years ago

cliu32 commented 4 years ago

Hi, Soren, Thank you for developing the tool. It is easy to install and I can run your test dataset successfully. However, when I run my own fastq file with the corresponding f/F/r/R sequences, I got a series of fatal errors (below). The same adapter sequences are used in this sample as in your paper. Thanks for troubleshooting!

[command] longread_umi nanopore_pipeline -d 34a.fastq -v 30 -o 34a_out -s 90 -e 90 -m 1000 -M 1800 -f CAAGCAGAAGACGGCATACGAGAT -F (forward primer) -r AATGATACGGCGACCACCGAGATC -R (reverse primer) -c 3 -p 1 -q r941_min_high_g330 -t 1

[from the log file] ... === Summary ===

Total reads processed: 5,544 Reads with adapters: 0 (0.0%) Reads that were too short: 0 (0.0%) Reads that were too long: 0 (0.0%) Reads written (passing filters): 5,544 (100.0%)

Total basepairs processed: 7,063,877 bp Total written (filtered): 7,063,877 bp (100.0%)

Scoring long reads 5,544 reads (7,063,877 bp)

Outputting passed long reads

local:0/10/100%/4.9s Computers / CPU cores / Max jobs to run 1:local / 8 / 10 local:0/10/100%/4.9s usearch v11.0.667_i86linux32, 4.0Gb RAM (32.6Gb total), 8 cores (C) Copyright 2013-18 Robert C. Edgar, all rights reserved. https://drive5.com/usearch

License: personal use only

00:00 37Mb 0.1% Reading 39a_out/umi_binning/umi_ref00:00 41Mb 100.0% Reading 39a_out/umi_binning/umi_ref/umi12f.fa 00:00 76Mb 0.4% DF 00:00 73Mb 100.0% DF 00:00 73Mb 2203 seqs, 2203 uniques, 2203 singletons (100.0%) 00:00 73Mb Min size 1, median 1, max 1, avg 1.00 00:00 73Mb 0.0% Writing 39a_out/umi_binning/umi_ref00:00 73Mb 100.0% Writing 39a_out/umi_binning/umi_ref/umi12u.fa usearch v11.0.667_i86linux32, 4.0Gb RAM (32.6Gb total), 8 cores (C) Copyright 2013-18 Robert C. Edgar, all rights reserved. https://drive5.com/usearch

License: personal use only

00:00 37Mb 0.1% Reading 39a_out/umi_binning/umi_ref00:00 41Mb 100.0% Reading 39a_out/umi_binning/umi_ref/umi12u.fa 00:00 80Mb 0.4% DF 00:00 73Mb 100.0% DF 00:00 73Mb 2203 seqs (tot.size 2203), 2203 uniques, 2203 singletons (100.0%) 00:00 73Mb Min size 1, median 1, max 1, avg 1.00 00:00 77Mb 100.0% DB 00:00 85Mb 100.0% 2202 clusters, max size 2, avg 1.0 00:00 85Mb 0.0% Writing centroids to 39a_out/umi_bi00:00 85Mb 100.0% Writing centroids to 39a_out/umi_binning/umi_ref/umi12c.fa

  Seqs  2203

Clusters 2202 Max size 2 Avg size 1.0 Min size 1 Singletons 2201, 99.9% of seqs, 100.0% of clusters Max mem 85Mb Time 1.00s Throughput 2203.0 seqs/sec.

[bwa_index] Pack FASTA... 0.00 sec [bwa_index] Construct BWT for the packed sequence... [bwa_index] 0.03 seconds elapse. [bwa_index] Update BWT... 0.00 sec [bwa_index] Pack forward-only FASTA... 0.00 sec [bwa_index] Construct SA from BWT and Occ... 0.01 sec [main] Version: 0.7.17-r1188 [main] CMD: bwa index 39a_out/umi_binning/umi_ref/umi12c.fa [main] Real time: 0.086 sec; CPU: 0.052 sec [bwa_aln_core] calculate SA coordinate... 0.25 sec [bwa_aln_core] write to the disk... 0.00 sec [bwa_aln_core] 28817 sequences have been processed. [main] Version: 0.7.17-r1188 [main] CMD: bwa aln -n 6 -t 1 -N 39a_out/umi_binning/umi_ref/umi12c.fa 39a_out/umi_binning/umi_ref/umi12p.fa [main] Real time: 0.272 sec; CPU: 0.266 sec [bwa_aln_core] convert to sequence coordinate... 0.01 sec [bwa_aln_core] refine gapped alignments... 0.00 sec [bwa_aln_core] print alignments... 0.01 sec [bwa_aln_core] 28817 sequences have been processed. [main] Version: 0.7.17-r1188 [main] CMD: bwa samse -n 10000000 39a_out/umi_binning/umi_ref/umi12c.fa 39a_out/umi_binning/umi_ref/umi12p_map.sai 39a_out/umi_binning/umi_ref/umi12p.fa [main] Real time: 0.043 sec; CPU: 0.033 sec cat: 39a_out/umi_binning/umi_ref/umi_ref.fa: No such file or directory [E::stk_seq] failed to open the input file/stream. [bwa_index] Pack FASTA... 0.05 sec [bwa_index] Construct BWT for the packed sequence... [bwa_index] 1.10 seconds elapse. [bwa_index] Update BWT... 0.03 sec [bwa_index] Pack forward-only FASTA... 0.06 sec [bwa_index] Construct SA from BWT and Occ... 0.34 sec [main] Version: 0.7.17-r1188 [main] CMD: bwa index 39a_out/umi_binning/read_binning/reads_tf_umi1.fa [main] Real time: 1.629 sec; CPU: 1.577 sec [bwa_index] Pack FASTA... 0.06 sec [bwa_index] Construct BWT for the packed sequence... [bwa_index] 1.17 seconds elapse. [bwa_index] Update BWT... 0.03 sec [bwa_index] Pack forward-only FASTA... 0.05 sec [bwa_index] Construct SA from BWT and Occ... 0.35 sec [main] Version: 0.7.17-r1188 [main] CMD: bwa index 39a_out/umi_binning/read_binning/reads_tf_umi2.fa [main] Real time: 1.698 sec; CPU: 1.647 sec [bwa_seq_open] fail to open file '39a_out/umi_binning/read_binning/umi_ref_b1.fa' : No such file or directory [fread] Unexpected end of file [bwa_seq_open] fail to open file '39a_out/umi_binning/read_binning/umi_ref_b2.fa' : No such file or directory [fread] Unexpected end of file [09:07:06] UMI match filtering... [09:07:06] Read orientation filtering... [09:07:06] UMI match error filtering... [09:07:06] UMI bin/cluster size ratio filtering... [09:07:06] Print UMI matches... [09:07:06] Done.

Computers / CPU cores / Max jobs to run 1:local / 8 / 1 0

Computers / CPU cores / Max jobs to run 1:local / 8 / 1

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete local:0/1/100%/0.0s local:0/1/100%/0.0s Computers / CPU cores / Max jobs to run 1:local / 8 / 1 local:0/1/100%/0.0s [09:07:12 - DataIndex] No sample_registry in 39a_out/raconx3_medakax1/consensus/_consensus.hdf Traceback (most recent call last): File "/home/cliu/anaconda3/envs/longread_umi/bin/medaka", line 11, in sys.exit(main()) File "/home/cliu/anaconda3/envs/longread_umi/lib/python3.6/site-packages/medaka/medaka.py", line 532, in main args.func(args) File "/home/cliu/anaconda3/envs/longread_umi/lib/python3.6/site-packages/medaka/stitch.py", line 125, in stitch index = medaka.datastore.DataIndex(args.inputs) File "/home/cliu/anaconda3/envs/longread_umi/lib/python3.6/site-packages/medaka/datastore.py", line 206, in init self.metadata = self._load_metadata() File "/home/cliu/anaconda3/envs/longread_umi/lib/python3.6/site-packages/medaka/datastore.py", line 244, in _load_metadata with DataStore(first_file) as ds: File "/home/cliu/anaconda3/envs/longread_umi/lib/python3.6/site-packages/medaka/datastore.py", line 39, in init self.fh = h5py.File(self.filename, self.mode) File "/home/cliu/anaconda3/envs/longread_umi/lib/python3.6/site-packages/h5py/_hl/files.py", line 269, in init fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr) File "/home/cliu/anaconda3/envs/longread_umi/lib/python3.6/site-packages/h5py/_hl/files.py", line 99, in make_fid fid = h5f.open(name, flags, fapl=fapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 78, in h5py.h5f.open OSError: Unable to open file (unable to open file: name = '39a_out/raconx3_medakax1/consensus/_consensus.hdf', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0) sed: can't read 39a_out/raconx3_medakax1/consensus_raconx3_medakax1.fa: No such file or directory gawk: fatal: cannot open file `39a_out/consensus_raconx3_medakax1.fa' for reading (No such file or directory) usearch v11.0.667_i86linux32, 4.0Gb RAM (32.6Gb total), 8 cores (C) Copyright 2013-18 Robert C. Edgar, all rights reserved. https://drive5.com/usearch

License: personal use only

usearch -fastx_uniques 39a_out/variants/m_temp.fa -strand both -fastaout 39a_out/variants/u_temp.fa -uc 39a_out/variants/u_temp.uc -sizeout

---Fatal error--- Empty file 39a_out/variants/m_temp.fa usearch v11.0.667_i86linux32, 4.0Gb RAM (32.6Gb total), 8 cores (C) Copyright 2013-18 Robert C. Edgar, all rights reserved. https://drive5.com/usearch

License: personal use only

usearch -cluster_fast 39a_out/variants/u_temp.fa -id 0.995 -strand both -centroids 39a_out/variants/c1_temp.fa -uc 39a_out/variants/c1_temp.uc -sort length -sizeout -sizein

---Fatal error--- Cannot open 39a_out/variants/u_temp.fa, errno=2 No such file or directory usearch v11.0.667_i86linux32, 4.0Gb RAM (32.6Gb total), 8 cores (C) Copyright 2013-18 Robert C. Edgar, all rights reserved. https://drive5.com/usearch

License: personal use only

usearch -cluster_fast 39a_out/variants/c1_temp.fa -id 0.995 -strand both -centroids 39a_out/variants/c2_temp.fa -uc 39a_out/variants/c2_temp.uc -sort length -sizein -sizeout

---Fatal error--- Cannot open 39a_out/variants/c1_temp.fa, errno=2 No such file or directory gawk: fatal: cannot open file `39a_out/variants/u_temp.uc' for reading (No such file or directory) cat: 39a_out/variants/centroids.fa: No such file or directory

Computers / CPU cores / Max jobs to run 1:local / 8 / 10

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete local:0/10/100%/0.0s cat: '39a_out/variants/phasing_consensus//variant.fa': No such file or directory

MaestSi commented 4 years ago

Hi, I am having the same error. Were you able to solve it? Thanks, Simone

cliu32 commented 4 years ago

Do you also see very little clustering going on? I got Seqs 2203, Clusters 2202. I could not solve it and wish for some input from the author.

thekatpod commented 4 years ago

Thanks for the great software, I'm excited to analyze my data but I'm also getting the same empty file error: cat: umi_out/umi_binning/umi_ref/umi_ref.fa: No such file or directory Has anyone solved this yet? It's really frustrating.

MaestSi commented 4 years ago

I have done some troubleshooting on our data (100k reads), running the software chunk by chunk and I observed that the problem was that few reads were assigned to UMIs with size >= 3. In fact, at line 260 of umi_binning.sh script (if I understood it correctly), after mapping all candidate UMIs umi12p.fa to the putative reference UMIs umi12c.fa, only reference UMIs with coverage at least 3 are retained in file umi12cf.fa. If you have no UMIs that survive this filtering step (and the subsequent chimera filtering step), you may end up with an empty UMI reference database. This was not a problem related to the reading of the UMI sequence itself, but due to the low amount of PCR duplicates in our experiment, possibly caused by too low sequencing depth compared to the DNA amount we used. I guess that all these errors may be explained by the same reasoning. Another parameter that helped with our data, was to set -v 3, but I don't think this may solve the empty reference database error. What do you think? Simone

SorenKarst commented 4 years ago

Hi everyone,

Sorry about the late reply.

The pipeline is not very debug friendly and right now requires a intimate understanding of each step. Hopefully, I will have time this summer to add proper checks and terminal messages, to remedy this.

As MaestSi points out the problem is most likely a sub-optimal ratio between number of tagged molecules and data generated. From my ONT R9.4.1 experiments it seems like I need a per molecule coverage of >15x to properly detect the molecule UMI with the pipeline. This means any molecule with < 15x coverage is not "detected" and processed at all.

In cliu32's case I am quite confident the molecule/data ratio is the problem. With 5544 reads you would need <370 tagged molecules (5544/15 ) in your sample to successfully produce UMI consensus sequences. cliu32 detects 2203 UMI sequences, which a clustered into 2202 unique UMI clusters. This indicates there is >>> 370 molecules tagged in the sample - my guess would be up towards 100 000 molecules, and hence the method breaks.

The interesting question is then: How can we ensure we start with the correct amount of tagged molecules? This can be difficult to determine as copy number, DNA integrity, and primer efficiency will impact this in different ways for different targets. We have used test sequencing and semi-quantitative PCR to estimate number of templates in our samples.

If you want to get started with method quickly I would recommend starting from an amplicon of your target and use that as input for the method. By using an amplicon template it is easy to dilute to the desired number of templates for your purpose. Just a word of caution, we have had problems generating PCR products from < 1000 molecules, but this is probably a PCR optimization problem. The downside of this method is that PCR chimeras and errors are not removed, but you will still get very high quality amplicons.

mitja-remus-emsermann commented 3 years ago

Hi Soren,

Thanks again for the tool - I am still excited about the idea of whole ribosomal operons in microbial ecology.

I believe that I have finally installed your pipeline correctly (on Ubuntu 20.04 and Ubuntu 16.04 running in a virtual machine). Currently, I am testing the pipeline with the small nanopore test dataset and the settings suggested in the readme (copy pasted into the terminal). Similar the cliu32, I am facing an issue that the pipeline is not running through and finishes with empty files and error messages during the pipeline. variants.fa and consensus_raconx3_medakax1_3.fa are empty.

This line in the logs is especially suscipious: environment: line 12: 18110 Illegal instruction (core dumped) $RACON -t 1 -m 8 -x -6 -g -8 -w 500 $RACON_ARG $RB $OUT/ovlp.paf $OUT/${UMINO}_sr.fa > $OUT/${UMINO}_tmp.fa

I have attached my logs. longread_umi_nanopore_pipeline_log_2021-02-17-16:32:26.txt

Thanks in advance, Mitja

MaestSi commented 3 years ago

Hi, I had a similar issue with racon v1.10 as described here. I solved it installing a newer racon version, e.g. v1.13. Once you have activated the environment, you could try: conda update racon Let me know if it works! Simone

mitja-remus-emsermann commented 3 years ago

Hi Simone,

thanks for the tip! I have updated racon (which worked like a charm) but I still end up with the same errors and empty files.

I have checked the folders and files that were generated. The UMI binning and trimming seemed to have worked and the files are populated with data. I also found a populated *.SAM file in /test_941/umi_binning/umi_ref.

I believe that you are right that the error has to do with the racon polishing, though checking through the options of the command that is called illegal, they seem perfectly fine.

I am worried that I make some kind of silly mistake. Besides downloading the git, installing conda and then longread_umi and getting usearch run, there is nothing else to install, is there?

Thanks, Mitja