NationalGenomicsInfrastructure / piper

A genomics pipeline build on top of the GATK Queue framework
9 stars 9 forks source link

Some questions before adoption of piper #45

Open biocyberman opened 9 years ago

biocyberman commented 9 years ago

Hello Piper Developers, I have noticed piper for a while but I did not pay enough attention to it. Now, my familarity with Queue/GATK framework, and a pressure to increase analysis throughput, I decide to do something similar to Piper. Actually I intend to adopt it to my work. But there are some questions I would like clarify first:

  1. What work need to be done to have Piper run on a my cluster? I cloned the repo, compiled and installed it but I haven't run a test run. Looking throuhg the code, I saw UppmaxConfig appears many times. Does it mean Piper is hard-coded for your system, Uppmax, a lot?
  2. Which Piper repo should I fork from? There are now 10 Piper forks, it is even hard for me to now which one I should fork from. NationalGenomicsInfrastructure/piper, johandahlberg/piper or Molmed/piper?
  3. I want to adapt Piper for color-space data produced by Lifetechs SOLiD sequencers. What are your thought about this? Would it be enough just to write a new class inAlignmentUtils.scala` to provide new aligner for the whole workflows?
  4. A folllow up question for the question above, in general how do I add support for a new aligner, for example Novoalign to use with Piper?

Thanks for the work with Piper!

Vang

johandahlberg commented 9 years ago

Hello Vang!

I'm happy to hear that you are considering adapting Piper.

  1. What work need to be done to have Piper run on a my cluster? I cloned the repo, compiled and installed it but I haven't run a test run. Looking throuhg the code, I saw UppmaxConfig appears many times. Does it mean Piper is hard-coded for your system, Uppmax, a lot?

It depends on what functionality you are interested in accessing,and what type of cluster you are running. Piper has been adapted to be used on the Uppmax cluster (which is a Slurm cluster) using the drmaa jobrunner available in Queue. This might work out of the box on your cluster, or it might require changing the jobrunner (Queue has a few different ones). The exact nature of how to do this depends on the scenario you are trying to adopt it to. If you could expand a bit on what kind of cluster you want to run on I might be able offer additional advice.

If you chose to run in the "Shell" jobrunner mode you shouldn't need to make any modifications. And even if this does not allow you to distribute your computations as efficiently (as all analysis will be run on a single node), it might actually be preferable for very high through-put analysis. This is at least what we've seen with our current hardware configuration where we use this scheme to run human whole genome analysis.

  1. Which Piper repo should I fork from? There are now 10 Piper forks, it is even hard for me to now which one I should fork from. NationalGenomicsInfrastructure/piper, johandahlberg/piper or Molmed/piper?

You should fork from this repo: NationalGenomicsInfrastructure/piper as this is now the official main repo. (This has moved from Molmed/piper which might be part of what causes the confusion).

  1. I want to adapt Piper for color-space data produced by Lifetech`s SOLiD sequencers. What are your thought about this? Would it be enough just to write a new class in AlignmentUtils.scala to provide new aligner for the whole workflows?

Exactly.

  1. A follow up question for the question above, in general how do I add support for a new aligner, for example Novoalign to use with Piper?

I've written a short tutorial about Piper/Queue here: https://github.com/johandahlberg/PiperWorkshop/blob/master/tutorial.md This is somewhat outdated by know - but I'll look into updating it very soon since I'm giving a workshop on Piper here: http://www.uppnex.se/events/eInfraMPS2015 so keep an I out and there should be some updates on this front in January.

I hope I've been able to answer your questions. Let me know if you have any further questions. I'm currently on vacation and I might therefore not be very quick with answering but I'll get back to you as soon as possible.

Cheers, Johan

biocyberman commented 9 years ago

Hi Johan,

Thank you very much for taking time to answer my questions. And Happy New Year :-) I was just checking things out and planning for my activities in 2015. So, no pressure to reply to my questions quickly.

The answers are very useful and I can use them as a guide to keep me on the right track. I missed the deadline for registration to attend the Piper workshop. However, by any chance, if I still can attend it or any future workshop about it, please let me know.

I took some more time to look at Piper and found that it requires FASTQ sequence files to begin. I unfortunately do not work with FASTQ file very much. As you may know, SOLiD sequencers produce color-space data in XSQ file format, which is a variant of HDF5 format. Therefore I will have to change several Scala files (i.e. SetupFileCreator.scala, Sample.scala, etc). Guess I need more experience with Piper to find my way around the code quicker.

I will send you some other questions that are not related to Piper code to your email address.

Cheers and best wishes, Vang

johandahlberg commented 9 years ago

Hi!

I'd recommend that you write the organizers of the workshop if you are interested. Perhaps the can squeeze in one more.

Since I've never worked with color-space data I unfortuneatly it's difficult for me to say exactly what changes would have to be made. As you've already noted the input formats do use fastq-files, however there is really nothing special going on there so it should be fairly simple to change it to use some other file format.

You are very welcome to contact me, you can find my details here: http://katalog.uu.se/empinfo?id=N11-121_2&q=johan+dahlberg

/Johan