DaehwanKimLab / hisat-genotype

GNU General Public License v3.0
23 stars 15 forks source link

Separate the reference generation step from genotyping step #20

Closed tinyheero closed 3 years ago

tinyheero commented 3 years ago

Hi there,

I've noticed that when running HISAT-genotype it will first download and prepare a set of references if it doesn't already exist. I am interested in figuring out how to separate the reference generation from the actual running of the genotyping step. This would ease the integration of HISAT-genotype into a larger workflow.

I assume that the wrapper hisatgenotype calls a set of underlying scripts to prepare the references. Is it possible to call these underlying scripts independently of the wrapper to generate the required references? If so, what scripts should one be calling?

chbe-helix commented 3 years ago

Hi Fong,

The reference generation is currently integrated within the modules of HISATgenotype. Version 1.3.1 (coming out within a few weeks), removes this requirement and adds and option to pre-build all indices needed at install within the HISATgenotype install folder or, if one prefers, adds the option of building this in any data directory of your choice during the first run of HISATgenotype on the system. Will this new system meet your needs?

Thanks, Chris

tinyheero commented 3 years ago

That's great to hear that the newer version will decouple the reference generation from the HISATgenotype modules. Regarding the two options:

  1. Ideally, we would put HISAT-genotype in a Docker image for our pipeline. So putting all the pre-build indices into the Docker image would result in a large footprint for each Docker image we build that contains HISAT-genotype. So I don't think that option will work for us.
  2. The option of specifying a data directory of choice during the first run of HISAT-genotype is more feasible for us. However, it seems odd that the reference generation has to happen during a run and can't be decoupled from it. I could imagine a scenario where multiple 'first' HISAT-genotype runs might be launched spawning reference generation clashes.

Is there not an option to have a hybrid of the two options? Where you can pre-build your references before any runs, but specify the data directory to store it. This way any runs will use that data directory.

If not, I guess one can conceptually think of the first run of HISAT-genotype being the reference generation step. One could use the example data provided in the tutorial to initiate this step. Then the reference data can be stored and reused on different clusters/machines. Is that thought process correct?

chbe-helix commented 3 years ago

Hi Fong,

Sure thing! It can certainly be decoupled completely. I'll add some additional options or an independent wrapper/script to build/download the references before running hisatgenotype. Then it is a matter of specifying that directory during each run of hisatgenotype using the new syntax I have added to v1.3.1. I'll draft a script next week, test it, then integrate it into the new release.

Thanks, Chris

tinyheero commented 3 years ago

Amazing! Thanks Chris!

chbe-helix commented 3 years ago

Hi Fong,

The new version of HISATgenotype (1.3.1) has been released and has a new option to direct HISATgenotype to an index folder. You should now only have to download the index once and only at install if you desire. The manual will be updated with these changes soon. Let me know if you have any issues getting things added to a Docker image in the meantime. Thanks!

Thanks, Chris