Psy-Fer / SquiggleKit

SquiggleKit: A toolkit for manipulating nanopore signal data
MIT License
122 stars 23 forks source link

Best way to predix PAF fature from gencode #15

Closed callumparr closed 5 years ago

callumparr commented 5 years ago

If I have the feature from gencode for instance and wanted to pull all fast5 relating to EEF1A1-201 transcript:

ENST00000309268.10|ENSG00000156508.17|OTTHUMG00000015031.6|OTTHUMT00000128718.1|EEF1A1-201|EEF1A1|2303|protein_coding|

Would I filter using something like

-x "ENST00000309268.10|ENSG00000156508.17|OTTHUMG00000015031.6|OTTHUMT00000128718.1|EEF1A1-201|EEF1A1|2303|protein_coding|"

or just use the transcript symbol?

-x EEF1A1-201

I was trying to figure it from examples given but was 100% sure.

Psy-Fer commented 5 years ago

Hello,

If you have a paf file from minimap 2, it should be as simple as doing a grep on column 6 for the "Target sequence name" of what you want, then using that as your input filter file with flag -p, --paf

If you have a sam file, with say, a bed file with your choice overlaps, and the samtools view -hL selection.bed ... command, then simply extract the readIDs into a flat file using something like grep -v ^@ filtered.sam | cut -f1 > my.flat.file.txt and use the -f, --flat flag for fast5_fetcher

The -x flag is for use with the trim option for easy naming of trimmed file output.

Psy-Fer commented 5 years ago

If you would like some more specific help, let me know what files you have and are working with, and I can give you some more specific examples.

:)

callumparr commented 5 years ago

Ah OK thank you I understand. First we should filter the fastq, or paf down and then use that to fetch the fast5.

Psy-Fer commented 5 years ago

Yep, that is correct. I'm glad that explanation helped.