DaehwanKimLab / hisat2

Graph-based alignment (Hierarchical Graph FM index)
GNU General Public License v3.0
464 stars 112 forks source link

hisat-3n conda installation #374

Open isaacvock opened 2 years ago

isaacvock commented 2 years ago

Quick question/suggestion: is there a way to install hisat-3n via conda or plans to make a hisat-3n conda recipe? Would be very helpful for integrating into Snakemake alignment workflows

paulinarosales commented 3 months ago

Are there any options following this Issue? At the moment I'm finding it very hard to install hisat-3n on an HPC system and I'm not able to use it for Snakemake workflows

isaacvock commented 3 months ago

There hasn't been an update to the hisat-3n branch (or any branch for that matter) in about 2 years, so the original developers/maintainers may have graduated or moved on.

I have used hisat-3n to align TimeLapse-seq/SLAM-seq data, but can say from experience that the benefit of using it is somewhat minor. STAR does a pretty good at accurately recovering high-mutation content reads on simulated data, even at simulated rates of s4U incorporation of around 10%. "accurate recovery of high mutation content reads" is judged by the distribution of mutation rate in reads, and the extent to which: 1) the R package I developed (bakR) provides an estimate for the mutation rate in new reads close to the true simulated value., and 2) the estimated fraction of reads that are new is consistently close to the true simulated fraction news. I think STAR was originally designed to be particularly robust to genome-mutations, and thus generally does not penalize mutations as heavily as other alignment inconsistencies, which is probably what helps it in this setting. Therefore, I have just defaulted to using STAR in my pipelines (bam2bakR and fastq2EZbakR, the latter still under development).

The other options besides just using STAR and accepting some loss of high-mutation content reads include:

  1. grandRescue, part of the gedi suite from the Erhard lab, uses STAR in conjunction with some custom tooling to better recover high mutation content reads. They also make a good point in their manuscript that there is nothing that stops you from doing 3-base genome alignment with STAR (and thus simulating the benefits of HISAT-3N), minus the challenge of having to manually impute where T's were originally located in your genome.
  2. NextGenMap has a --slamseq scoring setting that specifically eliminates penalties for T-to-C mutations. This is nice as you get the benefit of aligning to a higher-complexity 4-base genome, while also not penalizing mutations of interest. The downside is that NextGenMap is not splice aware, so if you are working with total RNA-seq data, its best to align to a transcriptome and thus throw out reads from pre-mRNA.
paulinarosales commented 3 months ago

Thank you very much for the elaborate response and helpful comments! I'll have a look at the recommended options :)