amojarro / carrierseq

a sequence analysis workflow for low-input nanopore sequencing
MIT License
6 stars 1 forks source link

Updating grep command to count channels #4

Closed vsoch closed 7 years ago

vsoch commented 7 years ago

Hey @amojarro ! I am looking at the $all_reads provided (thanks again!) and it looks like a read has a header like this:

@channel_180_d3f2c4b5-0117-453f-a042-573d3c145a33_template /Users/mojarro/Documents/Sequencing/Low_Input_Sequencing/minknow_1_5_18/fast5/pass/1/VENUSAUR_20170511_FNFAE22530_MN17220_sequencing_run_sample_id_55268_ch180_read1482_strand.fast5
GGGATAGTTCAGATGGGACTGGACATGGTTTAGGAGACATAACAGGACTGGCTGCCGGCCTGCCTGTGCCAGGCGTGCCTGTGGCTGGGTTCATGGACAGCCTGGGCGTTATGTTCAGCCTGTGTGGGTTCTGGACTTTGGTTCAGGATGGGACATTAAAATTCTGAAATTTAAGACATTGCCAGGCTGTTTGTGCCTGGCTGGCACCTCCCACTTTAAATTCAGCACGCGCATTTTCCCACCCTTCCTTTGAAAAGCGTACACACCACCTTCCGCTACCATTGCATTACACTTGCACACTGCCTTCCTGGGACCCGGCGCACTGCGCGCTGCGCTGCGCGCCACACAGCACACCACACTTAACACCTCCCACCACCAGCCCAGTTCTTCCCGTGCGCCCCGCCTCCCACAAAACGCCACATACTACCCCACTACCCCACTTCCAGCGTTTGATTCGGAGAACATAACCAGCATGGGCTGAGCACACCGGCTGTGCCTGCCTCTTCCTTGCTGCCAGCACTTCCTTCTGCTTCCCTTCTCCCACCCTTCTTCCTGTCACCATGCTACCATACGCCGGCCAGGTGGCCTGGACCTTGTTCGCCCCAAAACAGACATGCCATGGTTCTTCATTGCCTTGGTGGCCCCACCGCGCCGCCTGCGCCTGCCTGGTTGCTGCGCACCCACTGCCTTCTTCCATTCTTCAGGGCACTTCCAGCAACCAGTGGCTGGGACAGCATTGGACTGGAGACTGGGTTGGATGGGTTCAGGATGGATGAGACACCAGCATGCCTGGGACATGGGCATGGGAGATGGGAGGCTGGAACAGTGGTTCTGAACACATTCTCTACTCAGG

and I am thinking we may need to update the grep line to find them?

    echo Starting Poisson calculation...
    # 06 grep - extract all channels used, delete duplicates to count unique (n/512) channels used
    echo 'Counting total channels in use...'
    grep -Eo '_ch[0-9]+_|ch=[0-9]+' $all_reads > $output_folder/06_poisson_calculation/01_reads_channels.lst
    awk '!seen[$0]++' $output_folder/06_poisson_calculation/01_reads_channels.lst > $output_folder/06_poisson_calculation/02_channels_used.lst

Specifically it would be this step. Would a more general grep for @channel work? I'm not trained in variant calling so I wanted to have a discussion vs. coming up with something on my own. How long does this step usually take?

vsoch commented 7 years ago

oh wow that was much faster! I changed it to:

grep -Eo '@channel_[0-9]+' $all_reads > $output_folder/06_poisson_calculation/01_reads_channels.lst

and now the file has a very long list that looks like this:

@channel_180
@channel_180
@channel_182
@channel_182
@channel_182
@channel_182
@channel_182
@channel_182
@channel_182
@channel_182
@channel_182
[truncated]

I'll hold off on trying out the next steps before we chat, it's likely that this command needs to be tweaked, and (I think you are on the East Coast?) and (hopefully) already sleeping. Chat later!

amojarro commented 7 years ago

Hi @vsoch, the current grep command is definitely not optimal. For this particular data set, it took my iMac (Late 2014, 4Ghz i7, 32 GB, w/ Solid State Drive) about 2 minutes.

However, from my experience, I have had the most consistent success (with various MinKnow, Albacore, and Poretools versions) by either searching for the channel in the .fast5 file path _ch[0-9]+_ when converting basecalled fast5 files to fastq or for ch=[0-9]+ if the fastq was generated from MinKnow during live basecalling or from Albacore with -o fast5,fastq enabled.

There is something similar here and here before the "good read vs. bad read" sorting step.

amojarro commented 7 years ago

@vsoch I forgot to mention that the header @header_abc_etc will not always include the channel information @channel_123_header_abc_etc depending on the software or software version.

vsoch commented 7 years ago

ah this makes sense, I didn't know the headers varied. I think the original is good then, thanks for the help.

And a heads up I'm going to do a small PR later to adjust some of the environment variables (just found a small typo for a path)