-M 2 Question - Githubissues

calacademy-research / minibar

Dual barcode and primer demultiplexing for MinION sequenced reads

BSD 2-Clause "Simplified" License

35 stars 5 forks source link

-M 2 Question #6

Open lroppolo opened 2 years ago

lroppolo commented 2 years ago

Hello there!

I have a question about the -M 2 option for identifying sample types using the barcodes.

I see that option 2 finds matched barcodes on both ends of sequence, and identifies pairs that match a sample ID. I am using a dual-indexed barcode set on 16 samples, and when I select this option for demultiplexing, I end up with nearly 18,000 individual fastq files as output instead of 16 individual bins for my samples. Am I doing something wrong here? I just need some guidance on how this works so I can be sure I'm doing the right thing.

Thanks!

Lauren

jbh-cas commented 2 years ago

That does seem excessive. Let's look at some of the headers for the output files by doing this: head -n 1 *fastq | head -n 30 That will show the file names and first lines for 10 of the fastq files to give a sense of what is being called. You can just paste that into a response. Also if you could share the barcode file as an attachment that would be helpful. My first guess is that there is something in the barcode file that causes each record to be seen as a sample. Though I don't know what that might be. That's why taking a look at the barcode file and the headers will be helpful. ---------------------------- Original Message ----------------------------

Subject: [calacademy-research/minibar] -M 2 Question (Issue #6)

From: "LR Brazell" @.***>

Date: Wed, September 14, 2022 9:14 am

To: "calacademy-research/minibar" @.***>

Cc: "Subscribed" @.***>

Hello there!

I have a question about the -M 2 option for identifying sample types using the barcodes.

I see that option 2 finds matched barcodes on both ends of sequence, and identifies pairs that match a sample ID. I am using a dual-indexed barcode set on 16 samples, and when I select this option for demultiplexing, I end up with nearly 18,000 individual fastq files as output instead of 16 individual bins for my samples. Am I doing something wrong here? I just need some guidance on how this works so I can be sure I'm doing the right thing.

Thanks!

Lauren

--

Reply to this email directly or view it on GitHub:

https://github.com/calacademy-research/minibar/issues/6

You are receiving this because you are subscribed to this thread.

Message ID: @.***>

lroppolo commented 2 years ago

Hi Jim,

Thank you very much for your reply!

I'm going to attach the barcode file below: longreads_demux.csv

For reference, I am using the Twist UDI primer sets, and I will also attach a screen grab of the reference sheet below so that you can get an idea of where I am getting the pieces of the barcode file from:

I am getting errors when I try to run the command you provided to look at the headers of the fastq files- evidently the files don't have at least 10 lines. On seeing this error, I chose to look at my smallest output file in its entirety, and have pasted it below:

@8c581613-6b2c-4a30-b990-7660417f51d7 runid=534c0b0e13044fa9b3e67426197b926f4230572b read=3690 ch=998 start_time=2022-08-18T19:55:25.874796+00:00 flow_cell_id=PAM31864 protocol_group_id=CN_longReads sample_id=no_sample parent_read_id=8c581613-6b2c-4a30-b990-7660417f51d7 basecall_model_version_id=2021-05-05_dna_r9.4.1_promethion_384_dd219f32 h+(3),h-(3) 97230 97232 97123 97131 97042 97049 97231
ACGCTTCGATCTCTCTCTCTTTCCTTCTCTCTGTCTCTCTCTGCCTGTCTCTCTCACTCTGTCTTCTGTCTTACACTCTCTCTCTGCCTGCCTGTCTCTCTCACTCTCTCTCTCTGTGTGTCTCTCTCTCTCTTTCTGTTTCTCTCTGTCTCTCTCTGTCTGTCTCTGTCTTTCTCTGTCTGTCTCTTTGTCTGTCTGTCTTTGTCTTTCCTTCT
+
04*'%')&&*+/{{{{{{{850/('(+.1{{<4447{{{{{420103887{2{{3/..362445=<)((((511+*+.>{{{{{{4320/4777879{{{{6,,,/{{{{{{{{@68={{{;;?{{{{{{{{{322210003+*{{{5455{{{{{7121,*''++)*+++//02{{44599889=998889<=<=:9:=>CA@?>BACC:89>:

Please let me know if I can provide you with anything else, and again I very much appreciate your willingness to help me work through this!

-Lauren

jbh-cas commented 2 years ago

This is a csv which means fields separated by commas. We need a tsv where tabs are the separator character and this is the message I get running the command where I have screen scraped your one rec example $ minibar.py longreads_demux.csv email_rec.fq -D Need at least 5 tab delimited columns in the barcode_file. Here is the first line of 'longreads_demux.csv':

SampleID,FwIndex,FwPrimer,RvIndex,RvPrimer I am going to change the commas to tabs and also I am going to add Samp_ in fornt of the number at line begin so that the sample name stands out more at the end of the header. But I am thinking this not exactly the file set you are using. I'll send the the tsv version back to you and you can let me know what it does. If it does do the same thing perhaps you can send a handful of the input records to test against. ---------------------------- Original Message ----------------------------

Subject: Re: [calacademy-research/minibar] -M 2 Question (Issue #6)

From: "LR Brazell" @.***>

Date: Sat, September 17, 2022 5:12 pm

To: "calacademy-research/minibar" @.***>

Cc: "Jim Henderson" @.***>

"Comment" @.***>

Hi Jim,

Thank you very much for your reply!

I'm going to attach the barcode file below:

longreads_demux.csv

For reference, I am using the Twist UDI primer sets, and I will also attach a screen grab of the reference sheet below so that you can get an idea of where I am getting the pieces of the barcode file from:

I am getting errors when I try to run the command you provided to look at the headers of the fastq files- evidently the files don't have at least 10 lines. On seeing this error, I chose to look at my smallest output file in its entirety, and have pasted it below:
@8c581613-6b2c-4a30-b990-7660417f51d7 runid=534c0b0e13044fa9b3e67426197b926f4230572b read=3690 ch=998 start_time=2022-08-18T19:55:25.874796+00:00 flow_cell_id=PAM31864 protocol_group_id=CN_longReads sample_id=no_sample parent_read_id=8c581613-6b2c-4a30-b990-7660417f51d7
basecall_model_version_id=2021-05-05_dna_r9.4.1_promethion_384_dd219f32 h+(3),h-(3) 97230 97232 97123 97131 97042 97049 97231
ACGCTTCGATCTCTCTCTCTTTCCTTCTCTCTGTCTCTCTCTGCCTGTCTCTCTCACTCTGTCTTCTGTCTTACACTCTCTCTCTGCCTGCCTGTCTCTCTCACTCTCTCTCTCTGTGTGTCTCTCTCTCTCTTTCTGTTTCTCTCTGTCTCTCTCTGTCTGTCTCTGTCTTTCTCTGTCTGTCTCTTTGTCTGTCTGTCTTTGTCTTTCCTTCT

+

***@***.***={{{;;?{{{{{{{{{322210003+*{{{5455{{{{{7121,*''++)*+++//02{{44599889=998889<=<=:9:=>CA@?>BACC:89>:
Please let me know if I can provide you with anything else, and again I very much appreciate your willingness to help me work through this!

-Lauren

--

Reply to this email directly or view it on GitHub:

https://github.com/calacademy-research/minibar/issues/6#issuecomment-1250159921

You are receiving this because you commented.

Message ID: @.***>

jbh-cas commented 2 years ago

Try the embedded longreads_demux.tsv file for your barcode input and see how that works.

Also, if you could, on your original barcode file, do a wc -l command to count the number of lines and see what that tells us

Here it is for file longreads_demux.tsv showing it has 17 lines:

$ wc -l longreads_demux.tsv 17 longreads_demux.tsv

longreads_demux.tsv

SampleID    FwIndex FwPrimer    RvIndex RvPrimer
Samp_1  CACTCAAGAA  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA   TGGATGGCAA  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_2  AGAGCCATTC  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA   TTCACCAGCT  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_3  CACGATTCCG  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA   CCTGAGTAGC  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_4  TTGGAGCCTG  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA   AGGTGTCCGT  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_5  TTACGACTTG  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA   GTCTGGTTGC  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_6  TTAAGGTCGG  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA   CTCTTAGATG  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_7  GGTTCTGTCA  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA   TATCACCTGC  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_8  GATACGCACC  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA   CAGAGGCAAG  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_9  TCGCGAAGCT  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA   CCGGTCAACA  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_10 GTTAAGACGG  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA   TCACGAGGTG  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_11 CCGGTCATAC  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA   CCATAGACAA  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_12 GTCAGCTTAA  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA   GAGCTTGGAC  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_13 ACCGCGGATA  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA   TACGGTGTTG  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_14 GTTGCATCAA  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA   TTCAACTCGA  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_15 TGTGCACCAA  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA   AAGGCAGGTA  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_16 ATCTGTGGTC  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA   CGGCCAATTC  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

lroppolo commented 2 years ago

Hello there Jim!

Thank you so much for your reply-- I have been using a .tsv file, for whatever reason the file converted to a .csv when I pulled it down, but I was able to check and confirm that my file does have 17 lines as well.

When I run the script with the .tsv file again, I am having the same issue happen. I will attach my script below:

FILES="/myDirectory/fastq_pass/*"

for file in $FILES
do
     echo "processing $file file..."
     python3 /mydirectory/minibar.py longreads_demux.tsv $file -T -M 2 -F -P CN_
done

Essentially, I am getting output files that begin with the "CN_" prefix, and have every combination possible of Samp_1 through Samp_16. The filenames look something like this:

CN_Samp_1_Samp_15_Samp_14_Samp_7_Samp_3.fastq CN_Samp_2_Samp_16_Samp_4_Samp_9.fastq CN_Samp_3_Samp_8.fastq

And the output follows the format that I pasted in my last comment, if that is helpful. Please let me know if this helps make any sense of what I'm doing and we can go from there! I appreciate your willingness to help me troubleshoot.

All the best,

Lauren

jbh-cas commented 2 years ago

Lauren,

The Twist barcodes are 10bp which is 2 more than standard Illumina barcodes but quite a bit smaller than typical Nanopore barcodes. The ones used in the minibar paper were 15bp each and I checked with a colleague who used the nanopore 96-plex kit recently for 40+ photobacteria species and pointed out this quote: "An ONT native barcode is 40 bp in length (24 bp for the barcode itself plus 8 bp of flanking sequence on each side)" and even so she mentioned there were a lot of unclassified reads.

You can try -e 2 to reduce the error tolerance for the 10bp Twist barcodes but I do worry with the R9 chemistry that the error rate might be too great to separate out a large number of the reads. I'm hopeful that R10 chemistry improves things but I don't have any experience with it -- that is, R10, plenty with hope :)

best, Jim H.

jbh-cas commented 2 years ago

Try the attached longreads_demux.tsv file for your barcode input and see how that works. Also, if you could, on your original barcode file, do a wc -l command to count the number of lines and see what that tells us Here it is for the attached file longreads_demux.tsv showing it has 16 lines: $ wc -l longreads_demux.tsv 16 longreads_demux.tsv ---------------------------- Original Message ----------------------------

Subject: Re: [calacademy-research/minibar] -M 2 Question (Issue #6)

From: "LR Brazell" @.***>

Date: Sat, September 17, 2022 5:12 pm

To: "calacademy-research/minibar" @.***>

Cc: "Jim Henderson" @.***>

"Comment" @.***>

Hi Jim,

Thank you very much for your reply!

I'm going to attach the barcode file below:

longreads_demux.csv

For reference, I am using the Twist UDI primer sets, and I will also attach a screen grab of the reference sheet below so that you can get an idea of where I am getting the pieces of the barcode file from:

I am getting errors when I try to run the command you provided to look at the headers of the fastq files- evidently the files don't have at least 10 lines. On seeing this error, I chose to look at my smallest output file in its entirety, and have pasted it below:
@8c581613-6b2c-4a30-b990-7660417f51d7 runid=534c0b0e13044fa9b3e67426197b926f4230572b read=3690 ch=998 start_time=2022-08-18T19:55:25.874796+00:00 flow_cell_id=PAM31864 protocol_group_id=CN_longReads sample_id=no_sample parent_read_id=8c581613-6b2c-4a30-b990-7660417f51d7
basecall_model_version_id=2021-05-05_dna_r9.4.1_promethion_384_dd219f32 h+(3),h-(3) 97230 97232 97123 97131 97042 97049 97231
ACGCTTCGATCTCTCTCTCTTTCCTTCTCTCTGTCTCTCTCTGCCTGTCTCTCTCACTCTGTCTTCTGTCTTACACTCTCTCTCTGCCTGCCTGTCTCTCTCACTCTCTCTCTCTGTGTGTCTCTCTCTCTCTTTCTGTTTCTCTCTGTCTCTCTCTGTCTGTCTCTGTCTTTCTCTGTCTGTCTCTTTGTCTGTCTGTCTTTGTCTTTCCTTCT

+

***@***.***={{{;;?{{{{{{{{{322210003+*{{{5455{{{{{7121,*''++)*+++//02{{44599889=998889<=<=:9:=>CA@?>BACC:89>:
Please let me know if I can provide you with anything else, and again I very much appreciate your willingness to help me work through this!

-Lauren

--

Reply to this email directly or view it on GitHub:

https://github.com/calacademy-research/minibar/issues/6#issuecomment-1250159921

You are receiving this because you commented.

Message ID: @.***>

jbh-cas commented 2 years ago

Lauren,

I just put a new version 0.23 up on github that should reduce the number files that you are getting. Though the sequences will go into a Multiple_Matches.fastq file, so the -e itest is still worth trying.

best, Jim Henderson

lroppolo commented 2 years ago

Hello Jim!

I am about to download the latest version and try this. Will keep you posted on how things are going.

Thank you for your help!

Lauren

On Wed, Oct 19, 2022 at 2:05 PM Jim Henderson @.***> wrote:

[Caution: Email from External Sender. Do not click or open links or attachments unless you know this sender.]

Lauren,

I just put a new version 0.23 up on github that should reduce the number files that you are getting. Though the sequences will go into a Multiple_Matches.fastq file, so the -e itest is still worth trying.

best, Jim Henderson

— Reply to this email directly, view it on GitHub https://github.com/calacademy-research/minibar/issues/6#issuecomment-1284387576, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKMPKQ2CFO2MXIY5KLGWSP3WEAZ7JANCNFSM6AAAAAAQMSL4GY . You are receiving this because you authored the thread.Message ID: @.***>