MaestSi / UMInator

A Nextflow pipeline for generating consensus sequences from Nanopore reads tagged with UMIs
GNU General Public License v3.0
5 stars 2 forks source link

Error when UMI_db_p1.sam contains an empty alignment #3

Closed jayyu-0528 closed 4 months ago

jayyu-0528 commented 4 months ago

Hello Simone,

After successfully processing the Zymo test data, I am now trying to run the UMInator pipeline on the nanopore data generated from my lab's MinION Mk1B, using the most updated MinKNOW and basecalling software (R10.4.1, 400 bps, 5 kHz, SUP model).

I tried to start the analysis with a small subset of my data. To do this, I combined the first 10 fastq files generated from the sequencing run, resulting in a single file with 120 MB, containing 22176 reads.

My UMI is slightly different from the original 18-bp design used by Karst et al. The only difference is that I added a 5-bp sample-specific barcode to the original design, so the whole region before the read is: [adaptor]-[5-bp sample-specific barcode]-[NNNYRNNNYRNNNYRNNN]-[homology primer]

For example, my first sample was PCR-amplified with the primer above, with a known barcode TGACA. Sample 2 has a known barcode GGTTC, and so forth. This barcode was added, so that I could pool multiple samples into a single sequencing run. Matching this barcode to my record should allow me to tell which read comes from which starting sample.

I then run the UMInator pipeline, with no change to the UMInator.conf file, and the following call: (Note that I have not updated the Medaka model yet; I have not been able to get to the polishing stage due to the upcoming error.)

image

An error was thrown at the readsUMIsAssignment process:

Error message, original

To help debugging, I added a few print commands into the FilterUMIs.R script, which is throwing the error, and re-run the pipeline:

Added commands to print out suspected variables

Error message, with  suspected variables printed to console

It seems like this error occurs whenever an alignment inside UMI_db_p1.sam does not contain sec_al, which leads to NM_sd to be NA, and thus the conditional inside if (NM_mean < max_NM_mean && NM_sd < max_NM_sd) is neither true or false.

I used nano to open UMI_db_p1.sam, and found a pattern: All "correct" alignments, such as Alignment # 1 - #85, have a lot of actual alignments after the string "MD:Z:23", such as shown in this screenshot (This is Alignment #85). Alignment #85 has many actual alignments

The error-causing alignment, such as Alignment #86, has no actual alignment after the string "MD:Z:23". Alignment #86 has no alignment

Alignment #86 is not unique. Scrolling down the file, such "no-alignment" entries occur occasionally, taking up maybe 1-5% of all entries that I browsed through.

I am currently reading into the UMInator scripts, and BWA documentation, to try to figure out why such "no-alignment" entry was generated in the UMI_db_p1.sam and UMI_db_p2.sam files. If you happen to know the cause of these entries, and/or ways to correctly handle this error, I will love to learn about them. Thanks again!

Best, Jay

MaestSi commented 4 months ago

Hi, I think the issue should be solved by setting NM_sd variable to an arbitrarily high value in case there are not secondary alignments. I edited line 45 of Filter_UMIs.R script to: NM_sd <- ifelse(is.na(sd(c(prim_al_NM, sec_al_NM))), 100000, sd(c(prim_al_NM, sec_al_NM))) I think another issue may be due to the fact that at lines 110-117 of UMInator.nf script:

#search candidate UMIs with exact length between adapters and primers
    cutadapt -j ${task.cpus} -e ${params.tolCutadaptErr} -O ${params.minLenOvlp} -m ${params.UMILen} -M ${params.UMILen} \
    --discard-untrimmed --match-read-wildcards \
    -g ${params.FW_adapter}...${params.FW_primer} -g ${params.RV_adapter}...${params.RV_primer} \
    -G \$RV_primer_R...\$RV_adapter_R -G \$FW_primer_R...\$FW_adapter_R \
    -o ${params.results_dir}/candidateUMIsExtraction/${sample}/UMI_part1_db_tmp1.fastq \
    -p ${params.results_dir}/candidateUMIsExtraction/${sample}/UMI_part2_db_tmp1.fastq \
    \$READS_START \$READS_END

UMIs with the expected sequence length, placed between the adapter and the primer are searched for. Probably, given your slightly different design compared to the one by Karst et al. (the sample-specific barcode is not more "external" than the adapter), you should keep --UMILen=18 and incorporate the sample-specific barcode in the adapter sequence. If this works, you will probably need to run the pipeline as many time as the number of samples. By the way, random downsampling for testing the pipeline is usually not very effective. In fact, randomly downsampling a few reads you are very likely to end up with few/no PCR duplicates. Conversely, you should try running the pipeline (e.g. for 1 sample) with the full dataset, and only a posteriori select a subset of reads assigned to a number of choice of UMIs (e.g. 1000 UMIs). Let me know if this helps solving the issue! Best, SM

jayyu-0528 commented 4 months ago

Hi Simone,

Thanks so much! The edit resolves the error. I am now able to complete the entire pipeline on my data (both down-sampled and full)!

The suggestion are very helpful. Running the pipeline on the entire dataset results in a lot more reads assigned to each UMI, likely minimizing the occurrences of the no-secondary-alignment issue described above.

So far, I have been running with the modified UMI Pattern and UMILen = 23. I will next try to run with your other suggestion, i.e., by including the sample-specific 5-bp barcode into the adaptor, instead of into the UMI, and see how that affect the output. Will update below if I see a big difference.

Thanks again for sharing a great software!

Best, Jay