minimap error - Githubissues

aheravi commented 3 years ago

Hi, I am trying to use ampBinner_10X.py on my fastq and getting errors on the alignment step. Could you please comment on that?

Thanks!

Running code:

ampBinner_10X.py --in_fq P.fastq.gz --barcode_list 3M_Feb_737_Apr_Aug.txt --barcode_upstream_seq AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT --out_prefix D8 --num_threads 24 --minimap2 /linux-x86_64-centos7/minimap2-2.15/minimap2 >run.log 2>&1 &

Error:

sh: line 1: 51263 Killed                  //minimap2-2.15/minimap2 -t 1 --for-only --eqx -c --cs -N 200 -k 5 -w 3 -n 1 -m 10 -s 40 -A 4 -x map-ont /D8.tmp.thread16.barcode_with_anchor.fasta /D8.tmp.thread16.left_tail_barcode_candidate.fastq > /D8.tmp.thread16.left_tail_barcode_compare.paf 2> /dev/null
[01/06/2021 12:48:49] ERROR: Failed to run command: /minimap2-2.15/minimap2  -t 1 --for-only --eqx -c --cs -N 200 -k 5 -w 3 -n 1 -m 10 -s 40 -A 4 -x map-ont  /D8.tmp.thread16.barcode_with_anchor.fasta /D8.tmp.thread16.left_tail_barcode_candidate.fastq > /D8.tmp.thread16.left_tail_barcode_compare.paf 2> /dev/null
[01/06/2021 12:48:49] Return value is: 35072

fangli80 commented 3 years ago

Hi, Thanks for using AmpBinner. How many barcodes do you have in the 3M_Feb_737_Apr_Aug.txt file? I guess it's 3 million? If so, I think it's too many, which caused a memory issue.

Do you have a more precise list of barcodes? It would be great if you can narrow down this list to a few thousand. Not only for saving memory but also for better demultiplexing results. You can get a precise list of barcodes if you have the 10X genomics short-read sequencing data.

aheravi commented 3 years ago

No, unfortunately, I don't have the 10X data. That is why I am using all available barcodes to figure out the used ones.

I see some of the threads got failed (4 out of 24). I re-ran the failed commands and they got finished successfully. Not sure what is next after having the missing paf files?

fangli80 commented 3 years ago

If you don't know the barcodes, you'd better ask the person who performed the experiment. Usually, a subset of the 3 million barcodes was used in one experiment. I only tested AmpBinner when the number of barcodes is about 10,000.

By the way, are you sure the barcode upstream sequence is AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT?

If you finished AmpBinner successfully, you will see 3 output files: prefix.all_reads.txt, which contains the barcodes of all reads prefix.demultiplexing.PASS.reads.txt, which contains the barcodes that were confidently assigned. prefix.demultiplexing.statistics.txt, which is the number of reads of each barcode.

aheravi commented 3 years ago

Hi @fangli08, We don't know the exactly used barcodes. That is why I am testing all possible ones. Anyway, I successfully ran your tool on the 50,000 split lists.

WRT to the barcode upstream sequence, I think the sequence should be the one that you've mentioned in your comment. I see more reads identified with that sequence compared with an alternative sequence identified by running some code on the reads (which seems to be the second half of the sequence in your comment).

#cellular_barcode_seq   num_reads
AACACACTCAGCCTTC    1481

Using the alternative sequence, ACACTCTTTCCCTACACGACGCTCTTCCGATCT.

#cellular_barcode_seq   num_reads
AACACACTCAGCCTTC    147

Please let me know your thoughts on that. Thanks!

fangli80 commented 3 years ago

That looks correct.

aheravi commented 3 years ago

Hi @fangli08 ,

Some of the 10X barcodes I used in running "ampBinner_10X.py" were duplicates and so when I split the list, they randomly ended up in two different files. Now, I see a different number of reads for the duplicate barcodes. Any thoughts on that?

grep AACCATGAGCAGGTCA *stat*txt
xac.demultiplexing.statistics.txt:AACCATGAGCAGGTCA  113
xfu.demultiplexing.statistics.txt:AACCATGAGCAGGTCA  7

wc -l xac*spli*txt
50000 xac.10X_barcodes_splitted.txt

wc -l xfu*spli*txt
50000 xfu.10X_barcodes_splitted.txt

fangli80 commented 3 years ago

When binning the reads, AmpBinner will consider the best matched barcode and the second best matched barcode, to exclude potential misclasification due to sequencing error of the long reads.

To confidently assign a read to a barcode, the following two criteria is required: 1) the number of edite bases (mismatch + insertions + deletions) of the best matched barcode < 3 2) the number of edite bases of the second best matched barcode - the number of edite bases of the best matched barcode > 2

For example, if a barcode list has the following two barcodes:

barcode-1: AACCATGAGCAGGTCA barcode-2: ATCCATGAGCAGGTCA (only one base difference)

If a read has a sequence of AACCATGAGCAGGTCA right after the barcode_upstream_seq, then barcode-1 is the best matched barcode (num_edited_bases = 0) and barcode-2 is the second best matched barcode (num_edited_bases = 1) . It meets criterion 1 but not criterion 2, so this read is not assigned to barcode-1, because it may have barcode-2 but happen to have one sequencing error in the second base.

Therefore, all the barcodes should be supplied in one file. I thought you split the list simply to test if the tool works in your environment.

Since there are too many barcodes, which use too much resources, I think you can narrow down this list by collecting all barcodes from the prefix.all_reads.txt files of the split runs, generate a new barcode list, and then remove all duplicates in the barcode list file and run the tool with all the new barcodes in one file.

WGLab / AmpBinner

minimap error #1