GMOD / jbrowse-components

Source code for JBrowse 2, a modern React-based genome browser
https://jbrowse.org/jb2
Apache License 2.0
208 stars 62 forks source link

not important: perhaps support indexed PAF? #4577

Closed alpapan closed 1 month ago

alpapan commented 1 month ago

I've been using minimap to map my pacbio long reads to my genome. Output is about 6 gb uncompressed

Uncompressed it gives a file size is greater than 2gb error.

RangeError: File size (6676642565) is greater than 2 GiB
RangeError [ERR_FS_FILE_TOO_LARGE]: File size (6676642565) is greater than 2 GiB
    at FSReqCallback.readFileAfterStat [as oncomplete] (node:fs:323:11)

I could convert it to another file format or we could use the existing pafadapter (pif is twice the size and i'm not sure if it would index the reads or the assembly, regardless we don't need the reads).

But the PAFAdapter does not use indexes and I haven't managed to make it load on my desktop.

this is how I create a sorted index paf

sort --parallel=4 -S20G -k6,6 -k8,8n file.paf | bgzip --threads 5 > file.paf.gz && tabix -C -s6 -b8 -e9 -0 file.paf.gz

but I haven't been able to visualise it: i get empty tracks (as an alignment, feature, or synteny track) either as the alignment against the genome or just one scaffold.

I'm just going to use paftools.js splice2bed and convert it to BED (and maybe just use the SAM output of minimap2) but just wondering if the above makes sense?

cmdcolin commented 1 month ago

if you are trying to load raw reads as a track, i would definitely recommend converting to SAM and then further on to BAM or CRAM, and then loading as a plain old "AlignmentsTrack" (auto-inferred by e.g. jbrowse add-track yourfile.bam)

the PAF adapter was created as a method to visualize "assembly-to-assembly" alignments, so e.g.

minimap2 genome1.fa genome2.fa rather than minimap2 genome.fa reads.fq

The PAF adapter is inferred to be used for a JBrowse "SyntenyTrack", where both genome1.fa and genome2.fa are loaded as assemblies in your config.json (and you supply multiple 'assembly names' to the add-track command for it to indicate that the 'track' belongs to both assemblies: e.g. jbrowse add-track yourfile.paf -a genome2,genome1, note the order is flipped as compared to the order in the minimap2 command).

so basically, you'd want to create a de-novo assembly to use the PAFAdapter or the PairwiseIndexedPAFAdapter. the PAF adapter is good for smallish genome-to-genome alignments, but it loads the entire PAF into memory. the PairwiseIndexedPAFAdapter has the ability to load only the relevant portions of what you are viewing, particularly when you load a "synteny track" in the linear genome view (example session link, the lower panel will load much faster because it uses the PairwiseIndexedPAFAdapter e.g. pif.gz while the top tries to load the human vs mouse comparison PAF entirely into memory which is like 70mb gzipped or 200+Mb in memory https://jbrowse.org/code/jb2/v2.15.1/?config=test_data%2Fhs1_vs_mm39%2Fconfig.json&session=share-yzS0ST28zx&password=M8QHY)

note that the these whole genome alignments are challenging and we're still trying to work on scalability of the whole genome alignments for larger genomes, so definitely interested if you are running into those limits.

cmdcolin commented 1 month ago

here is a guide showing minimap2 FASTQ->CRAM workflow: https://www.htslib.org/workflow/fastq.html

it is sort of focused on paired end reads but same idea can be applied to long reads

alpapan commented 1 month ago

ok, thank you!

alpapan commented 1 month ago

p.s. the whole genome alignments/synteny views for jb2 works really well for me (genomes 300-400 mb), love it and use it a lot