OmarOakheart / nPhase

Ploidy agnostic phasing pipeline and algorithm
GNU General Public License v3.0
42 stars 4 forks source link

nPhase needs a guide/tutorial for working on large/very heterozygous genomes #13

Open OmarOakheart opened 3 years ago

OmarOakheart commented 3 years ago

As long as nPhase doesn't make efficient use of heuristics to drastically speed up prediction time, users will run into issues with trying to run it on large genomes and could benefit from a guide to help reduce the time it takes to obtain results and how to interpret them.

HMPNK commented 1 year ago

Absolutely... Any news here to cope for this?

Some features that should be added:

-Bam support, I try nPhase on a hexaploid plant (haploid genome ~650Mbp), nPhase inflates the data enormously, if this go on I will run out of disk:

-rw-rw-r-- 1 309G Jun 22 07:04 hexa.sam -rw-rw-r-- 1 302G Jun 22 08:23 hexa.pass.sam -rw-rw-r-- 1 302G Jun 22 10:53 hexa.sorted.header.sam -rw-rw-r-- 1 231G Jun 22 12:06 hexa.sorted.sam

OmarOakheart commented 1 year ago

Hi,

Those are some really large files, I imagine you have very high coverage?

Unfortunately this will require you to make some manual modifications to reduce the computational burden.

My recommendations would be to do the following:

  1. Reduce the coverage of your input files to 10X/haplotype (60X total)
  2. Run nPhase one chromosome at a time (you can do so by using a different reference fasta for each chromosome, there are other possibly better ways though)
  3. You can save time by using nphase partial and reusing the same VCF file each time (which will have been run on the entire genome)
  4. You can also save time by reusing the same long read SAM file if you have one with the reads fully mapped to the genome. nPhase will only look at positions in the VCF file.

If you'd like, you can email me at omaroakheart@gmail.com and we can set up a call to talk about your use case for nPhase. It's possible that nPhase isn't going to give you the data that you're looking for. For example, it shouldn't be capable of giving you a chromosome-scale phasing. But there are things it can do well, like phase individual genes and regions in a ploidy agnostic way. It depends what information you're trying to get.