NBISweden / Earth-Biogenome-Project-pilot

Assembly and Annotation workflows for analysing data in the Earth Biogenome Project pilot project.
https://www.earthbiogenome.org/
GNU General Public License v3.0
10 stars 8 forks source link

Create gethifireads.nf #8

Closed aersoares81 closed 2 years ago

aersoares81 commented 2 years ago

This will get the HiFi reads from the original BAM file that's delivered and convert them into a fasta file for downstream analyses. I will write a test for it now.

mahesh-panchal commented 2 years ago

Do we need fasta from the bam file? nf-core has a module for bam to fastq https://github.com/nf-core/modules/tree/master/modules/samtools/fastq which I've used in the genome properties workflow.

aersoares81 commented 2 years ago

I'm afraid if we keep PacBio reads as fastq it might cause problem downstream as every base will probably get flagged as "!" (low quality), since PacBio doesn't computer phred-like score AFAIK.

mahesh-panchal commented 2 years ago

Which tools are using the quality score downstream?

aersoares81 commented 2 years ago

I know Inspector has a mapping step, and I believe it uses minimap2, but I don't know how minimap takes in consideration quality scores in fastq files. I thought it would be just safer not store information that might confuse any program downstream that we might add, and that is technically incorrect.

mahesh-panchal commented 2 years ago

Inspector specifically runs minimap2 with ignore mapping quality: https://github.com/ChongLab/Inspector/blob/089a740f7deaef17d7ddb7f352626fb1134d76f0/inspector.py#L94

minimap2 -Q
aersoares81 commented 2 years ago

Yes, but I don't understand why we should keep these reads in fastq format since it does not provide any benefit that I can see, and uses more disk space.