UCSF-Costello-Lab / LG3_Pipeline

The original LG3 pipeline
https://github.com/UCSF-Costello-Lab/LG3_Pipeline
0 stars 0 forks source link

TODO: integrate exomeQualityPlots scripts into LG3 pipeline #88

Open ivan108 opened 6 years ago

ivan108 commented 6 years ago

Stephanie created a pipeline to generate quality plots on top of the results of LG3 pipeline, see exomeQualityPlots.

According to current procedure, I ran this pipeline after LG3 is done. I think it makes sense to include exomeQualityPlots scripts into LG3 pipeline.

HenrikBengtsson commented 6 years ago

Awesome. What steps in the LG3 Pipeline needs to be completed before exomeQualityPlots can be run?

/cc @SRHilz

HenrikBengtsson commented 6 years ago

I see that you literally included code from exomeQualityPlots - I would strongly advise against this approach.

This will effectively create a fork of exomeQualityPlots, which cause maintanence hell. Any minor bug fix that is done in the exomeQualityPlots repos has to be copied over this repos. Likewise, if we find a bug and fix it here, @SRHilz will have to make sure our fixes will be included in the exomeQualityPlots repos too. Such an approach is error prone and introduces lots of extra work.

The two alternatives I see are:

  1. @SRHilz discontinues the exomeQualityPlots repos and does all future work of that functionality toward this repos. Such a move requires some serious merging of the two git repositories in order preserve the commit history etc. It's doable, but it's something that requires some serious thinking and planing. I would not recommend doing this right now.

  2. Instead, I would suggest that the LG3 Pipeline will "load" exomeQualityPlots, e.g. by helping @SRHilz provide installation instructions and have the LG3 Pipeline find it via, say, env variable EXOME_QUALITY_PLOTS_HOME (which could have some smart default location). On TIPCC, this can all be encapsulated via an env module, e.g. module load exomeQualityPlots that module load lg3 could load by default.

ivan108 commented 6 years ago

I support alternative #1. In fact I don't use the original exomeQualityPlots pipeline as is, but developed a more comfortable interface to it (launching scripts). It is like a branch from the original pipeline, but no version control. Now I adapted interface to our coding style and variable names, so it becomes just another steps of LG3. The pipeline addition has 3 new steps:

_run_QC_1
_run_QC_2
_run_QC_3

_run_QC_1 and _run_QC_2 can be run in parallel, and _run_QC_3 after 1 and 2 are done..

With the second alternatives we would have to first substantially refactor original exomeQualityPlots pipeline and then to write some loading modules...

HenrikBengtsson commented 6 years ago

Hmm... you closed the issue. If you want to go with Alt 1, then you need to have @SRHilz onboard.

In fact I don't use the original exomeQualityPlots pipeline as is, but developed a more comfortable interface to it (launching scripts).

Unless there are major changes, you should be able to write your own wrapper scripts that achieve what you've done but that just call the original exomeQualityPlots pipeline with the proper setup.

... It is like a branch from the original pipeline, but no version control.

Sorry, but nope. It's not. You've forked, or rather copied the latest version, her work and now there are two code bases that have to be maintained. If this would have been a small script, it might be acceptable, but there's lots of code in there. You don't want to do this. "Trust me". Maintenance of software is hard enough. This "forking" adds work for @SRHilz - if not today, it'll get to her at some point in the future.

ivan108 commented 6 years ago

I totally agree that we don't want to version control the same code in two places. The simple solution could be to clone exomeQualityPlots pipeline elsewhere and make symbolic links from our scripts/ directory. The only small complication is when we want to fix a bug, we would have to go through Stephanie. BTW, I got consent from Stephanie to use her scripts in LG3 pipeline first thing...

SRHilz commented 6 years ago

Hi all,

I am fine with whatever is easiest. Ideally it would be great to use a clone of my repository, just to keep things tidy. Can I set you both up as contributors on that repo as well, so that you can fix bugs on it as well?

Thanks, Stephanie

On Mon, Oct 15, 2018 at 10:30 PM IVAN SMIRNOV notifications@github.com wrote:

I totally agree that we don't want to version control the same code in two places. The simple solution could be to clone exomeQualityPlots pipeline elsewhere and make symbolic links from our scripts/ directory. The only small complication is when we want to fix a bug, we would have to go through Stephanie. BTW, I got consent from Stephanie to use her scripts in LG3 pipeline first thing...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/UCSF-Costello-Lab/LG3_Pipeline/issues/88#issuecomment-430105903, or mute the thread https://github.com/notifications/unsubscribe-auth/AFkZUI8fabWoGlpH-po0q-iQ5_1ZCM-Wks5ulW7YgaJpZM4XdR2T .

SRHilz commented 6 years ago

And Henrik, regarding the question of what needs to be run first -

exomeQualityPlots has three main steps:

  1. Create files for each bam file with information about exome coverage (only needs final recal bams)
  2. Using the R.mutation files to know where each called mutation is, go into recal bams and pull out more detailed QC info for each mutation (needs final recal bams + R.mutations)
  3. Generate a final QC file and make final plots for everything (needs output of 1 + 2 above)

Steps 1 and 2 can be executed at the same time if desired, or 1 can be started even earlier in the pipeline, right after recal is done but before mutation calling.

Hope that helps (sorry I've been slow to respond - I have lab meeting today). Let me know if you have any questions.

On Mon, Oct 15, 2018 at 4:16 PM Henrik Bengtsson notifications@github.com wrote:

Awesome. What steps in the LG3 Pipeline needs to be completed before exomeQualityPlots can be run?

/cc @SRHilz https://github.com/SRHilz

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/UCSF-Costello-Lab/LG3_Pipeline/issues/88#issuecomment-430046457, or mute the thread https://github.com/notifications/unsubscribe-auth/AFkZUC0ETFTN6nItK5RyWu3sXLvaEXEKks5ulRdZgaJpZM4XdR2T .

HenrikBengtsson commented 6 years ago

I am fine with whatever is easiest. Ideally it would be great to use a clone of my repository, just to keep things tidy. Can I set you both up as contributors on that repo as well, so that you can fix bugs on it as well?

Thanks. Yes, working toward a single code base is key (here and in all other projects). Giving us direct push writes to your repository is the most straightforward and probably entails fewer surprises/hiccups. The alternative is that we work toward a (git) fork on your repository and when we think we have something that's worth incorporating, we send you a pull request that you accept/merge.

HenrikBengtsson commented 6 years ago

@ivan108, could you please drop those new files again? make check fails because of them.

ivan108 commented 6 years ago

removed