Demultiplexing Dual Indexed Primers

calacademy-research / minibar

Dual barcode and primer demultiplexing for MinION sequenced reads

BSD 2-Clause "Simplified" License

35 stars 5 forks source link

Demultiplexing Dual Indexed Primers #5

Closed lroppolo closed 2 years ago

lroppolo commented 2 years ago

Hi there!

I am trying to demultiplex some of my ONT sequences that I used Twist's Unique Dual Index primers to barcode with. I have concatenated all of my fastq reads into a single file, which passes 2 sanity checks using validateFastq and fastq_info prior to using Minibar. When I run Minibar, I am able to generate output without any visible errors (I use the -M 1 and -T parameters), but the output does NOT pass validateFastq or fastq_info.

I am seeing the following errors:

(Fastq_info): Line 129: invalid character ' (hex. code:'1b'), expected ACGTUacgtu0123nN.

(Validate_Fastq): Exception in thread "main" htsjdk.samtools.SAMException: Sequence and quality line must be the same length at line 125 in fastq

Additionally, my demultiplexing.tsv file looks like this: SampleID FwIndex FwPrimer RvIndex RvPrimer 9901 TGTGAAGGCC AGATCGGAAGAGCACACGTCTGAACTCCAGTCA TTGCTAAGGA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT 9902 CCTTGACTGC AGATCGGAAGAGCACACGTCTGAACTCCAGTCA ACTCCTTGGC AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT 9903 AATGCGTCGG AGATCGGAAGAGCACACGTCTGAACTCCAGTCA GAAGGCGAAC AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT 9904 AAGACTACAC AGATCGGAAGAGCACACGTCTGAACTCCAGTCA CAATACCTTG AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT 9905 GTCAGTGCAG AGATCGGAAGAGCACACGTCTGAACTCCAGTCA CGACGACAAG AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT 9906 CTCACCAGAA AGATCGGAAGAGCACACGTCTGAACTCCAGTCA GAACCTGACC AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT 9907 TCTCGTACTT AGATCGGAAGAGCACACGTCTGAACTCCAGTCA TTGCCTCGCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT 9908 TCAGATTAGG AGATCGGAAGAGCACACGTCTGAACTCCAGTCA TTCGTGTCGA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT

Do you have any suggestions where I might be getting hung up in demultiplexing with Minibar? I would appreciate any and all help you can provide me with. Many thanks!

Lauren

jbh-cas commented 2 years ago

Here are the output options from the help info:

    -S outputs sequence record in fasta or fastq format of input (default output)
    -T trims barcode and primer from each end of the sequence, then outputs record
    -C similar to S but uses upper/lower case to show found barcode indexes and primers
    -CC also colors found barcode blue, primer green if found, primer red otherwise

The default which is -S writes the records out with match info in the header.

Don't see your command line but the error from Fastq_info says it sees a hex 1b character in the file.

It is a bit arcane, but that code is the ASCII Escape character, decimal 27. When characters are written to your terminal screen a group of characters, beginning with Escape and ending in 'm', is used to control changes to the output on the screen.

That is, it changes what the current settings might be. This is what allows text color, boldness and the like.

Of those output options above we expect -CC but no others to embed ANSI Escape control sequences into the output, in this case to turn the characters, blue, green or red and back off again.

None of the other options are intended to include such Escape sequences. If -CC is not in the mix we'd need to see the command line you are using and at least a few lines of the output file to find out where the Escape code is turning up.

If you look at your file with less -R do you see any odd characters.

Btw, I like to use -CC and the default stdout output and pipe it to less -R to see what the program has guessed about the primers and barcodes. Then remove the -CC and rerun minibar piping its output to a fasta file.

As the nanopore chemistry has improved the error rate in these sequences has allowed a better determination of these technical sequences.

lroppolo commented 2 years ago

Hi Jim, thank you for your quick reply!

First, I used this command to visualize the colored output: python3 minibar.py demultiplex.tsv newly_merged_reads_barcoding.fastq.gz -CC | less -R

Next, I ran this command to produce the output I described in my previous message that is giving me errors: python3 minibar.py demultiplex.tsv newly_merged_reads_barcoding.fastq.gz -T -CC -M 1

So perhaps, this is where the problem lies- I need to remove the -CC option for my useable output. I will update once I have some results to share with you.

Thank you again for your help. I am a beginner with this tool (it is very cool!) and this is excellent feedback that I hope will push me in the right direction.

Lauren

lroppolo commented 2 years ago

Hi Jim,

An update for you:

My job finished running without any visible errors on behalf of Minibar, however, I am still getting errors when I use the fastq validation methods on the trimmed and demultiplexed file. The errors are as follows:

(Fastq_utils): fastq_utils 0.25.1 DEFAULT_HASHSIZE=39000001 Scanning and indexing all reads from /users/minibar/slurm-1801400.out Read name provided with no suffix

ERROR: Error in file /users/minibar/slurm-1801400.out: line 38426: header2 wrong. The line should contain only '+' followed by a newline or read name (header1).

(Validate_Fastq): INFO [2022-08-08 22:47:35,901] [ValidateFastq$] - Start Exception in thread "main" htsjdk.samtools.SAMException: Quality header must start with +: x(-1,-1), x(-1,-1) unk at line 38421 in fastq /users/minibar/slurm-1801400.out at htsjdk.samtools.fastq.FastqReader.readNextRecord(FastqReader.java:121) at htsjdk.samtools.fastq.FastqReader.next(FastqReader.java:152) at htsjdk.samtools.fastq.FastqReader.next(FastqReader.java:43) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at scala.collection.Iterator$class.foreach(Iterator.scala:891) at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) at nl.biopet.tools.validatefastq.ValidateFastq$.main(ValidateFastq.scala:52) at nl.biopet.tools.validatefastq.ValidateFastq.main(ValidateFastq.scala)

Do you have any thoughts on how I might be able to remedy these errors?

Thank you for your time!

Lauren

jbh-cas commented 2 years ago

I think ValidateFastq is being overly fastidious if that third line begins with a +_. As long as the 3rd line of every record begins with + it should be valid no matter the remainder.

However, you can easily change that line to + using awk, which should be on your system since you are running on linux.

As long as the other 3 lines of each record look good, this will change that 3rd line of the fastq record to +:

awk '2==(NR-1)%4{$0="+"}{print}' /users/minibar/slurm-1801400.out >/users/minibar/output.fastq

By the way x(-1,-1), x(-1,-1) unk indicates that no primer/barcode were found for that particular read.

After R and python, awk is probably the most useful programming language for bioinformatics. And Heng Li's bioawk extension of awk is great for handling fastq and multi-line fasta files. There is a version on the Cal Academy github that adds to Heng Li's bioawk with several other nice additions. You don't need either though to make this change.

lroppolo commented 2 years ago

Hi Jim!

Checking in- I have two sample files that my colleague and I have looked over and we've determined something going on with Minibar shifting the lines of output. I hope this helps; here are the 2 files:

preMinibar.txt postMinibar.txt

Reviewing the above files, it appears that Minibar is shifting the lines beginning at 38421 by inserting the string 'demultiplex.tsv newly_merged_reads_barcoding.fastq.gz Index edit dist 3, Primer edit dist 12, Search Len 80, Search Method 1, Output Type T$'. And because it has also inserted the '10000^M x(-1,-1), x(-1,-1) unk$' string into line 38422, and thus shifted the rest down one line. I hope that helps? Please let me know if this is something you want to look into further, and if not I appreciate all of your help to this point!

Best,

Lauren

jbh-cas commented 2 years ago

Lauren,

The files look like they are in rtf format instead of plaintext so I am not viewing them well. But I can see that of the few records shown, none of them had barcodes found: x(-1,-1), x(-1,-1) unk at end of header line.

The text you are describing demultiplex.tsv newly_merged_reads_barcoding.fastq.gz Index edit dist 3, Primer edit dist 12, Search Len 80, Search Method 1, Output Type T is written to stderr not stdout.

It would appear that whatever is capturing the text is capturing stdout and stderr. I saw you are using slurm but minibar is not doing much more work than a cat or grep of the file, you should be able to do a command line run with it.

Look at the Example in the ReadMe file. I just ran it locally as below and here is what it looks like on the screen

$ minibar.py IndexCombinationPeperomonia.txt PeperomiaTestSet.fasta >PeperomiaTestSet_SampleIDs.fa
IndexCombinationPeperomonia.txt PeperomiaTestSet.fasta : Index edit dist 4, Primer edit dist 11, Search Len 80, Search Method 3, Output Type S
750 seqs: H 750 HH 679 Hh 62 hh 0 IDs 741 Mult_IDs 0 (0.1109s)

Note that the output is written to a file via >PeperomiaTestSet_SampleIDs.fa and the output similar to what you are reporting is still written to the screen. That is because what is captured in PeperomiaTestSet_SampleIDs.fa was written to stdout and the text on the screen was written to stderr.

Maybe see if you can get this example working for you -- files in test data directory -- and then see if you can move to your intended fastq input.

lroppolo commented 2 years ago

Hello Jim!

I wanted to tell you that I tried your suggestion and successfully got my fastq file to pass all of the necessary tests. I feel so silly for asking and the solution being so simple, but nonetheless, appreciate your help so much!

Thank you and I look forward to incorporating your tool into my future work!

Best,

Lauren