Open aheravi opened 3 years ago
Hi,
Thanks for using AmpBinner.
How many barcodes do you have in the 3M_Feb_737_Apr_Aug.txt
file? I guess it's 3 million? If so, I think it's too many, which caused a memory issue.
Do you have a more precise list of barcodes? It would be great if you can narrow down this list to a few thousand. Not only for saving memory but also for better demultiplexing results. You can get a precise list of barcodes if you have the 10X genomics short-read sequencing data.
No, unfortunately, I don't have the 10X data. That is why I am using all available barcodes to figure out the used ones.
I see some of the threads got failed (4 out of 24). I re-ran the failed commands and they got finished successfully. Not sure what is next after having the missing paf files?
If you don't know the barcodes, you'd better ask the person who performed the experiment. Usually, a subset of the 3 million barcodes was used in one experiment. I only tested AmpBinner when the number of barcodes is about 10,000.
By the way, are you sure the barcode upstream sequence is AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
?
If you finished AmpBinner successfully, you will see 3 output files:
prefix.all_reads.txt
, which contains the barcodes of all reads
prefix.demultiplexing.PASS.reads.txt
, which contains the barcodes that were confidently assigned.
prefix.demultiplexing.statistics.txt
, which is the number of reads of each barcode.
Hi @fangli08, We don't know the exactly used barcodes. That is why I am testing all possible ones. Anyway, I successfully ran your tool on the 50,000 split lists.
WRT to the barcode upstream sequence, I think the sequence should be the one that you've mentioned in your comment. I see more reads identified with that sequence compared with an alternative sequence identified by running some code on the reads (which seems to be the second half of the sequence in your comment).
#cellular_barcode_seq num_reads
AACACACTCAGCCTTC 1481
Using the alternative sequence, ACACTCTTTCCCTACACGACGCTCTTCCGATCT.
#cellular_barcode_seq num_reads
AACACACTCAGCCTTC 147
Please let me know your thoughts on that. Thanks!
That looks correct.
Hi @fangli08 ,
Some of the 10X barcodes I used in running "ampBinner_10X.py" were duplicates and so when I split the list, they randomly ended up in two different files. Now, I see a different number of reads for the duplicate barcodes. Any thoughts on that?
grep AACCATGAGCAGGTCA *stat*txt
xac.demultiplexing.statistics.txt:AACCATGAGCAGGTCA 113
xfu.demultiplexing.statistics.txt:AACCATGAGCAGGTCA 7
wc -l xac*spli*txt
50000 xac.10X_barcodes_splitted.txt
wc -l xfu*spli*txt
50000 xfu.10X_barcodes_splitted.txt
When binning the reads, AmpBinner will consider the best matched barcode and the second best matched barcode, to exclude potential misclasification due to sequencing error of the long reads.
To confidently assign a read to a barcode, the following two criteria is required: 1) the number of edite bases (mismatch + insertions + deletions) of the best matched barcode < 3 2) the number of edite bases of the second best matched barcode - the number of edite bases of the best matched barcode > 2
For example, if a barcode list has the following two barcodes:
barcode-1: AACCATGAGCAGGTCA barcode-2: ATCCATGAGCAGGTCA (only one base difference)
If a read has a sequence of AACCATGAGCAGGTCA
right after the barcode_upstream_seq, then barcode-1 is the best matched barcode (num_edited_bases = 0) and barcode-2 is the second best matched barcode (num_edited_bases = 1) . It meets criterion 1 but not criterion 2, so this read is not assigned to barcode-1, because it may have barcode-2 but happen to have one sequencing error in the second base.
Therefore, all the barcodes should be supplied in one file. I thought you split the list simply to test if the tool works in your environment.
Since there are too many barcodes, which use too much resources, I think you can narrow down this list by collecting all barcodes from the prefix.all_reads.txt
files of the split runs, generate a new barcode list, and then remove all duplicates in the barcode list file and run the tool with all the new barcodes in one file.
Hi, I am trying to use ampBinner_10X.py on my fastq and getting errors on the alignment step. Could you please comment on that?
Thanks!
Running code:
Error: