SorenKarst / longread_umi

GNU General Public License v3.0
76 stars 29 forks source link

'No such file or directory' & Guppy version #40

Closed LowYuKen closed 3 years ago

LowYuKen commented 3 years ago

Hi, Soren Thank you for your research on long UMI and provided this code for the ease of use for my research. Unfortunately, some problems have occurred and I have no solution for them even after seeking aid from PI.

The bases were call by Guppy version Version 4.2.3+8aca2af8, using the following command line:

guppy_basecaller -i fast5_pass -s test --cpu_threads_per_caller 8 --trim_strategy none --chunk_size 1500 --chunks_per_runner 1024 --flowcell FLO-MIN106 --kit SQK-LSK109

The resulted fastq files were combined and run using the following command line:

longread_umi nanopore_pipeline -d test.fastq -v 30 \n -o test_longreads -s 120 -e 120 q-m 1900 -M 2300 \n -f CAAGCAGAAGACGGCATACGAGAT -F ATTCGCACTCCTCCAGCTTA \n -r AATGATACGGCGACCACCGAGATC -R GTTGGCGAGAAAGTGAAAGC \n -c 3 -p 1 -q r941_min_high_g330 -t 1 &

The process seemed fine at the beginning until it wasn't and one peculiar thing caught my attention, shown as followed:

Seqs 8062 Clusters 8062 Max size 2 Avg size 1.0 Min size 1 Singletons 8061, 100.0% of seqs, 100.0% of clusters Max mem 106Mb Time 1.00s Throughput 8062.0 seqs/sec.

This was a pilot run. So, I used plasmid DNA for both UMI tagging and amplification. I expected all the sequences should be identical, apart from sequencing and/or PCR errors. It is surprised to found out that all sequences are in different Clusters.

The process went on and generated umi12c.fa, umi12p_map.sai, and umi12p.fa.

From here, a series of error messages occurred, shown as following:

cat: test_longreads/umi_binning/umi_ref/umi_ref.fa: No such file or directory . . . . [bwa_seq_open] fail to open file 'test_longreads/umi_binning/read_binning/umi_ref_b1.fa' : No such file or directory [fread] Unexpected end of file . . . . [bwa_seq_open] fail to open file 'test_longreads/umi_binning/read_binning/umi_ref_b2.fa' : No such file or directory [fread] Unexpected end of file . . . . [11:48:30 - DataIndex] No sample_registry in test_longreads/raconx3_medakax1/consensus/*_consensus.hdf . . . . Thank you very much for your time and I do hope for a reply from you! Yu Ken

SorenKarst commented 3 years ago

Dear Yu Ken,

Thank you for trying out the pipeline!

The step you are referring to is the part of the pipeline that tries to detect the correct UMI sequences from the raw nanopore data with errors. Here we aim at finding the number of clusters equal or close to the number of UMIs in the dataset.

The reason the pipeline breaks on your data is because the coverage of the individual UMIs is too low ("...Max size 2..."). Ideally we would like an average cluster size of >10x. The most common explanation for this is that too many molecules were tagged during the UMI PCR probably due to too many template molecules present.

How much template DNA did you use for the UMI PCR? How big is your plasmid? How big is your target amplicon?

All the best Søren

LowYuKen commented 3 years ago

Dear Soren, Thank you very much for your response!

I used approximately 5.75x10^6 of initial DNA template for the UMI PCR. My plasmid is 8kbp but I’ve cut it with restriction enzyme to 2.5kbp + 5.5kbp. My target amplicon is in the 2.5kbp segment of plasmid and it is 2.2kbp (including UMI).

The data from my original post was an incomplete data set (with only 8062 sequences). We had sequenced approximately 5GB of dataset. In the analysis of this total data set, we got 1.4 million sequences and 1.3 million clusters, with a max cluster size of 2. This is the part where my PI realised too much DNA template was used.

After rounds of discussion with my PI, we too have come to a conclusion that the initial DNA template was too much and had done another round of PCR with 10^5 of DNA template. We are now waiting for the Nanopore results and should have the data analysed in a week or two.

Much appreciation for your reply and the effort you put in for this protocol!

Best of luck Yu Ken

LowYuKen commented 3 years ago

Hi Soren, After using less initial DNA template (1.15x10^5) and basecalled with an appropriate Guppy version (3.3.0), your pipeline worked wonderfully!!! Thank you very much!

I have a few questions regarding the output files for further analysis and understanding of this pipeline.

  1. What does the 'depth=383' in 'output_file/variants.fa' specifically means? ex:

    Cluster0_var1;depth=383 GACCACCAAATGCCCCTATCTTATCAACACTTCCGGAAACTACTGTTGTTAGACGACGGGACCGAGGCAGGTcCcCtAGA... ...

  2. If I want to find out the read depth of each and every UMI barcode which output file can I look into? I need to know how many UMI barcodes there were and, to a greater extend, how many individual molecule had been tagged, along with each and every one of their read depth.

  3. What does the 'ubs=17' in 'output_file/consensus_raconx2_medakax2.fa' means?

    umi1000bins;ubs=17 GACCACCAAATGCCCCTATCTTATCAACACTTCCGGAAAC

  4. In 'ouput_file/umi_binning/umi_ref/umi_ref.txt', the output format shows: umi39810 CAGCGCGCTGTAGTGGTC AGACAAAACATACCGTCG yes umi39810/umi39810/umi39810/umi39810 5 umi38030 TCTCATATTGTCGTGTAG ACCTATAACACAACGAAA yes umi38030/umi38030/umi38030/umi38030 6 ... Would you please kindly explain what does each column mean?

My master's thesis is currently based heavily on your pipeline and thus I require a thorough understanding of this pipeline. Thank you for time and effort you have put in!

Cheers! Yu Ken

SorenKarst commented 3 years ago

Hi Yu Ken,

That is great news!

To answer your questions:

  1. Depth is the number of UMI consensus sequences supporting a given variant. So in you example 383 UMI consensus sequences was used to generate the variant Cluster0_var1
  2. The number of reads for each UMI barcode (or UMI bin) can be found in the header of the consensus files as ubs=XXX. So umi1000bins;ubs=17 means 17 raw sequence reads was assigned to umi1000bins and used to generate the consensus sequence.
  3. See answer above.
  4. Column explanations:
    • Col1: UMI bin name
    • Col2: Sub UMI 1 found in one end of the read.
    • Col3: Sub UMI 2 found in the other end of the read. (concatenated UMI 1 and 2 make up the 36 UMI sequence).
    • Col4: Result of de novo chimera check. "yes" = not chimera. Keep UMI bin, "no" = chimera, discard UMI bin, "tandem" = same UMI in both ends, which is due to a very rare PCR artefact associated with the panhandle PCR principle.
    • Col5: Results from de novo chimera check. In the chimera check each sub UMI is matched against all other UMIs both as the sequence is listed and reverse complemented. If a sub UMI from a specific UMI bin matches a sub UMI from an other bin with higher absolute abundance, the specific UMI bin is classified as a chimera. The umi38030/umi38030/umi38030/umi38030 result reflects the best matching sub UMI in the entire dataset for each sub UMI in the specific UMI bin. To check reverse complemented as well 4 total tests are being made, hence four results. A UMI bin passes if all the best matches are it self. Failing bins will have best hits from other UMI bins, which would look like this umi38030/umi38030/umi38030/umi10000
    • Col6: Absolute abundance of a UMI bin eg umi38030 is observed 6 times in the dataset. Remember this number of observations is relatively low as it only counts the high quality observations.

Let me know if you any other questions.

regards Søren

LowYuKen commented 3 years ago

Hi Soren,

Much appreciated for your clear explanation! I will let you know if any questions or issues occured to me.

Once again, I am eternally grateful for your work in this pipeline and looking forward for this pipeline to be more widely utilised!

Cheers! Yu Ken