Open lroppolo opened 2 years ago
That does seem excessive. Let's look at some of the headers for the output files by doing this: head -n 1 *fastq | head -n 30 That will show the file names and first lines for 10 of the fastq files to give a sense of what is being called. You can just paste that into a response. Also if you could share the barcode file as an attachment that would be helpful. My first guess is that there is something in the barcode file that causes each record to be seen as a sample. Though I don't know what that might be. That's why taking a look at the barcode file and the headers will be helpful. ---------------------------- Original Message ----------------------------
Subject: [calacademy-research/minibar] -M 2 Question (Issue #6)
From: "LR Brazell" @.***>
Date: Wed, September 14, 2022 9:14 am
To: "calacademy-research/minibar" @.***>
Cc: "Subscribed" @.***>
Hello there!
I have a question about the -M 2 option for identifying sample types using the barcodes.
I see that option 2 finds matched barcodes on both ends of sequence, and identifies pairs that match a sample ID. I am using a dual-indexed barcode set on 16 samples, and when I select this option for demultiplexing, I end up with nearly 18,000 individual fastq files as output instead of 16 individual bins for my samples. Am I doing something wrong here? I just need some guidance on how this works so I can be sure I'm doing the right thing.
Thanks!
Lauren
--
Reply to this email directly or view it on GitHub:
https://github.com/calacademy-research/minibar/issues/6
You are receiving this because you are subscribed to this thread.
Message ID: @.***>
Hi Jim,
Thank you very much for your reply!
I'm going to attach the barcode file below: longreads_demux.csv
For reference, I am using the Twist UDI primer sets, and I will also attach a screen grab of the reference sheet below so that you can get an idea of where I am getting the pieces of the barcode file from:
I am getting errors when I try to run the command you provided to look at the headers of the fastq files- evidently the files don't have at least 10 lines. On seeing this error, I chose to look at my smallest output file in its entirety, and have pasted it below:
@8c581613-6b2c-4a30-b990-7660417f51d7 runid=534c0b0e13044fa9b3e67426197b926f4230572b read=3690 ch=998 start_time=2022-08-18T19:55:25.874796+00:00 flow_cell_id=PAM31864 protocol_group_id=CN_longReads sample_id=no_sample parent_read_id=8c581613-6b2c-4a30-b990-7660417f51d7 basecall_model_version_id=2021-05-05_dna_r9.4.1_promethion_384_dd219f32 h+(3),h-(3) 97230 97232 97123 97131 97042 97049 97231
ACGCTTCGATCTCTCTCTCTTTCCTTCTCTCTGTCTCTCTCTGCCTGTCTCTCTCACTCTGTCTTCTGTCTTACACTCTCTCTCTGCCTGCCTGTCTCTCTCACTCTCTCTCTCTGTGTGTCTCTCTCTCTCTTTCTGTTTCTCTCTGTCTCTCTCTGTCTGTCTCTGTCTTTCTCTGTCTGTCTCTTTGTCTGTCTGTCTTTGTCTTTCCTTCT
+
04*'%')&&*+/{{{{{{{850/('(+.1{{<4447{{{{{420103887{2{{3/..362445=<)((((511+*+.>{{{{{{4320/4777879{{{{6,,,/{{{{{{{{@68={{{;;?{{{{{{{{{322210003+*{{{5455{{{{{7121,*''++)*+++//02{{44599889=998889<=<=:9:=>CA@?>BACC:89>:
Please let me know if I can provide you with anything else, and again I very much appreciate your willingness to help me work through this!
-Lauren
This is a csv which means fields separated by commas. We need a tsv where tabs are the separator character and this is the message I get running the command where I have screen scraped your one rec example $ minibar.py longreads_demux.csv email_rec.fq -D Need at least 5 tab delimited columns in the barcode_file. Here is the first line of 'longreads_demux.csv':
SampleID,FwIndex,FwPrimer,RvIndex,RvPrimer I am going to change the commas to tabs and also I am going to add Samp_ in fornt of the number at line begin so that the sample name stands out more at the end of the header. But I am thinking this not exactly the file set you are using. I'll send the the tsv version back to you and you can let me know what it does. If it does do the same thing perhaps you can send a handful of the input records to test against. ---------------------------- Original Message ----------------------------
Subject: Re: [calacademy-research/minibar] -M 2 Question (Issue #6)
From: "LR Brazell" @.***>
Date: Sat, September 17, 2022 5:12 pm
To: "calacademy-research/minibar" @.***>
Cc: "Jim Henderson" @.***>
"Comment" @.***>
Hi Jim,
Thank you very much for your reply!
I'm going to attach the barcode file below:
For reference, I am using the Twist UDI primer sets, and I will also attach a screen grab of the reference sheet below so that you can get an idea of where I am getting the pieces of the barcode file from:
I am getting errors when I try to run the command you provided to look at the headers of the fastq files- evidently the files don't have at least 10 lines. On seeing this error, I chose to look at my smallest output file in its entirety, and have pasted it below:
@8c581613-6b2c-4a30-b990-7660417f51d7 runid=534c0b0e13044fa9b3e67426197b926f4230572b read=3690 ch=998 start_time=2022-08-18T19:55:25.874796+00:00 flow_cell_id=PAM31864 protocol_group_id=CN_longReads sample_id=no_sample parent_read_id=8c581613-6b2c-4a30-b990-7660417f51d7 basecall_model_version_id=2021-05-05_dna_r9.4.1_promethion_384_dd219f32 h+(3),h-(3) 97230 97232 97123 97131 97042 97049 97231 ACGCTTCGATCTCTCTCTCTTTCCTTCTCTCTGTCTCTCTCTGCCTGTCTCTCTCACTCTGTCTTCTGTCTTACACTCTCTCTCTGCCTGCCTGTCTCTCTCACTCTCTCTCTCTGTGTGTCTCTCTCTCTCTTTCTGTTTCTCTCTGTCTCTCTCTGTCTGTCTCTGTCTTTCTCTGTCTGTCTCTTTGTCTGTCTGTCTTTGTCTTTCCTTCT + ***@***.***={{{;;?{{{{{{{{{322210003+*{{{5455{{{{{7121,*''++)*+++//02{{44599889=998889<=<=:9:=>CA@?>BACC:89>:
Please let me know if I can provide you with anything else, and again I very much appreciate your willingness to help me work through this!
-Lauren
--
Reply to this email directly or view it on GitHub:
https://github.com/calacademy-research/minibar/issues/6#issuecomment-1250159921
You are receiving this because you commented.
Message ID: @.***>
Try the embedded longreads_demux.tsv file for your barcode input and see how that works.
Also, if you could, on your original barcode file, do a wc -l
command to count the number of lines and see what that tells us
Here it is for file longreads_demux.tsv showing it has 17 lines:
$ wc -l longreads_demux.tsv 17 longreads_demux.tsv
longreads_demux.tsv
SampleID FwIndex FwPrimer RvIndex RvPrimer
Samp_1 CACTCAAGAA AGATCGGAAGAGCACACGTCTGAACTCCAGTCA TGGATGGCAA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_2 AGAGCCATTC AGATCGGAAGAGCACACGTCTGAACTCCAGTCA TTCACCAGCT AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_3 CACGATTCCG AGATCGGAAGAGCACACGTCTGAACTCCAGTCA CCTGAGTAGC AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_4 TTGGAGCCTG AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGGTGTCCGT AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_5 TTACGACTTG AGATCGGAAGAGCACACGTCTGAACTCCAGTCA GTCTGGTTGC AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_6 TTAAGGTCGG AGATCGGAAGAGCACACGTCTGAACTCCAGTCA CTCTTAGATG AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_7 GGTTCTGTCA AGATCGGAAGAGCACACGTCTGAACTCCAGTCA TATCACCTGC AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_8 GATACGCACC AGATCGGAAGAGCACACGTCTGAACTCCAGTCA CAGAGGCAAG AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_9 TCGCGAAGCT AGATCGGAAGAGCACACGTCTGAACTCCAGTCA CCGGTCAACA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_10 GTTAAGACGG AGATCGGAAGAGCACACGTCTGAACTCCAGTCA TCACGAGGTG AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_11 CCGGTCATAC AGATCGGAAGAGCACACGTCTGAACTCCAGTCA CCATAGACAA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_12 GTCAGCTTAA AGATCGGAAGAGCACACGTCTGAACTCCAGTCA GAGCTTGGAC AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_13 ACCGCGGATA AGATCGGAAGAGCACACGTCTGAACTCCAGTCA TACGGTGTTG AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_14 GTTGCATCAA AGATCGGAAGAGCACACGTCTGAACTCCAGTCA TTCAACTCGA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_15 TGTGCACCAA AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AAGGCAGGTA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Samp_16 ATCTGTGGTC AGATCGGAAGAGCACACGTCTGAACTCCAGTCA CGGCCAATTC AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
Hello there Jim!
Thank you so much for your reply-- I have been using a .tsv file, for whatever reason the file converted to a .csv when I pulled it down, but I was able to check and confirm that my file does have 17 lines as well.
When I run the script with the .tsv file again, I am having the same issue happen. I will attach my script below:
FILES="/myDirectory/fastq_pass/*"
for file in $FILES
do
echo "processing $file file..."
python3 /mydirectory/minibar.py longreads_demux.tsv $file -T -M 2 -F -P CN_
done
Essentially, I am getting output files that begin with the "CN_" prefix, and have every combination possible of Samp_1 through Samp_16. The filenames look something like this:
CN_Samp_1_Samp_15_Samp_14_Samp_7_Samp_3.fastq CN_Samp_2_Samp_16_Samp_4_Samp_9.fastq CN_Samp_3_Samp_8.fastq
And the output follows the format that I pasted in my last comment, if that is helpful. Please let me know if this helps make any sense of what I'm doing and we can go from there! I appreciate your willingness to help me troubleshoot.
All the best,
Lauren
Lauren,
The Twist barcodes are 10bp which is 2 more than standard Illumina barcodes but quite a bit smaller than typical Nanopore barcodes. The ones used in the minibar paper were 15bp each and I checked with a colleague who used the nanopore 96-plex kit recently for 40+ photobacteria species and pointed out this quote: "An ONT native barcode is 40 bp in length (24 bp for the barcode itself plus 8 bp of flanking sequence on each side)" and even so she mentioned there were a lot of unclassified reads.
You can try -e 2 to reduce the error tolerance for the 10bp Twist barcodes but I do worry with the R9 chemistry that the error rate might be too great to separate out a large number of the reads. I'm hopeful that R10 chemistry improves things but I don't have any experience with it -- that is, R10, plenty with hope :)
best, Jim H.
Try the attached longreads_demux.tsv file for your barcode input and see how that works. Also, if you could, on your original barcode file, do a wc -l command to count the number of lines and see what that tells us Here it is for the attached file longreads_demux.tsv showing it has 16 lines: $ wc -l longreads_demux.tsv 16 longreads_demux.tsv ---------------------------- Original Message ----------------------------
Subject: Re: [calacademy-research/minibar] -M 2 Question (Issue #6)
From: "LR Brazell" @.***>
Date: Sat, September 17, 2022 5:12 pm
To: "calacademy-research/minibar" @.***>
Cc: "Jim Henderson" @.***>
"Comment" @.***>
Hi Jim,
Thank you very much for your reply!
I'm going to attach the barcode file below:
For reference, I am using the Twist UDI primer sets, and I will also attach a screen grab of the reference sheet below so that you can get an idea of where I am getting the pieces of the barcode file from:
I am getting errors when I try to run the command you provided to look at the headers of the fastq files- evidently the files don't have at least 10 lines. On seeing this error, I chose to look at my smallest output file in its entirety, and have pasted it below:
@8c581613-6b2c-4a30-b990-7660417f51d7 runid=534c0b0e13044fa9b3e67426197b926f4230572b read=3690 ch=998 start_time=2022-08-18T19:55:25.874796+00:00 flow_cell_id=PAM31864 protocol_group_id=CN_longReads sample_id=no_sample parent_read_id=8c581613-6b2c-4a30-b990-7660417f51d7 basecall_model_version_id=2021-05-05_dna_r9.4.1_promethion_384_dd219f32 h+(3),h-(3) 97230 97232 97123 97131 97042 97049 97231 ACGCTTCGATCTCTCTCTCTTTCCTTCTCTCTGTCTCTCTCTGCCTGTCTCTCTCACTCTGTCTTCTGTCTTACACTCTCTCTCTGCCTGCCTGTCTCTCTCACTCTCTCTCTCTGTGTGTCTCTCTCTCTCTTTCTGTTTCTCTCTGTCTCTCTCTGTCTGTCTCTGTCTTTCTCTGTCTGTCTCTTTGTCTGTCTGTCTTTGTCTTTCCTTCT + ***@***.***={{{;;?{{{{{{{{{322210003+*{{{5455{{{{{7121,*''++)*+++//02{{44599889=998889<=<=:9:=>CA@?>BACC:89>:
Please let me know if I can provide you with anything else, and again I very much appreciate your willingness to help me work through this!
-Lauren
--
Reply to this email directly or view it on GitHub:
https://github.com/calacademy-research/minibar/issues/6#issuecomment-1250159921
You are receiving this because you commented.
Message ID: @.***>
Lauren,
I just put a new version 0.23 up on github that should reduce the number files that you are getting. Though the sequences will go into a
best, Jim Henderson
Hello Jim!
I am about to download the latest version and try this. Will keep you posted on how things are going.
Thank you for your help!
Lauren
On Wed, Oct 19, 2022 at 2:05 PM Jim Henderson @.***> wrote:
[Caution: Email from External Sender. Do not click or open links or attachments unless you know this sender.]
Lauren,
I just put a new version 0.23 up on github that should reduce the number files that you are getting. Though the sequences will go into a Multiple_Matches.fastq file, so the -e itest is still worth trying.
best, Jim Henderson
— Reply to this email directly, view it on GitHub https://github.com/calacademy-research/minibar/issues/6#issuecomment-1284387576, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKMPKQ2CFO2MXIY5KLGWSP3WEAZ7JANCNFSM6AAAAAAQMSL4GY . You are receiving this because you authored the thread.Message ID: @.***>
Hello there!
I have a question about the -M 2 option for identifying sample types using the barcodes.
I see that option 2 finds matched barcodes on both ends of sequence, and identifies pairs that match a sample ID. I am using a dual-indexed barcode set on 16 samples, and when I select this option for demultiplexing, I end up with nearly 18,000 individual fastq files as output instead of 16 individual bins for my samples. Am I doing something wrong here? I just need some guidance on how this works so I can be sure I'm doing the right thing.
Thanks!
Lauren