Implement merge_gds and check_merge_gds

aofarrel commented 3 years ago

Oops! I got a little overeager and accidentally finished it early.

Checker workflow is included, tested, and passes, in spite of #14. ~~Maybe exclude PCA doesn't matter on the sample data?~~

aofarrel commented 3 years ago

Mentioned on Zoom:

Getting rid of overall workflow
Tidy up structure of the workflow

aofarrel commented 3 years ago

Wanted to clarify why I'm now against doing parameter_meta in this. Y'see, if we do it here, then that implies we should do it in all upcoming workflows, and uh...

https://github.com/DataBiosphere/analysis_pipeline_WDL/blob/implement-null-model/documentation/null-model-wf.md#additional-null-model-settings

aofarrel commented 3 years ago

Up until now, the CWL has been grabbing chromosome numbers by starting from a string that has been split on "chr" and then checking if they are one or two digit by checking if the 1st (0 index) character is numeric.

However, in check_merged_gds the CWL changes tactics. It's instead splitting again, this time on "." and grabbing the 0th result of that split. See line 63: valueFrom: "${\n return inputs.gds_file.path.split('chr')[1].split('.')[0]\n}"

This seems to raise a bit of a dilemma, as our goals for this project may conflict. It's not a matter of difficulty - bringing the current code closer in line to the CWL is simply replacing

    g = open("chr_number", "a")
    # gds_second_part is already split on 'chr'
    if(unicode(str(gds_second_part[1])).isnumeric()):
        # two digit number
        g.write(gds_second_part[:2])
        gds_second_part = gds_second_part[2:]
    else:
        # one digit number or Y/X/M
        g.write(gds_second_part[0])
        gds_second_part = gds_second_part[1:]
    g.close()

with something like with open("chr_number") write gds_second_part[0].split('.') or what have you. But that then enforces a new restraint on filenames. We are already restraining users to having chr in the filename, but this new function requires it to be set up like.this.where.there.is.a.period.immediately.after.chrX.where.x.is.the.chromosome.number.gds

For all TOPMed data I have access to, that's fine, because Gen3 indexes chr-level VCFs to always have a period immediately after the chromosome number. (Yes, I checked all 500ish, what else is there to do on Friday night?) But I only have access to 1000 Genomes + Tutorial in AnVIL Gen3, so I can't check if this would break CCDG or other projects. CCDG has specifically been mentioned as something people want to compare data against when using this pipeline. I also don't know if TOPMed data taken in from outside Gen3 still follows those naming conventions. There could also be other datasets that follow the 'chr' naming convention but not the chr-then-immediate-period convention.

Ultimately, the potential issue at play here is that the AnVIL Project, unlike BioData Catalyst, is mostly WDL and Galaxy focused. That means that this WDL pipeline may have a broader scope than the CWL BDC-and-TOPMed focused version that it is mimicking, so having slightly less strict restraints on input data may be appropriate.

For the time being I propose that the chromosome function is left as it is for now, until we can get clarification from stakeholders as to what are and are not acceptable limitations to place. We can always make the change later.

It could be that chrX + period is the standard across basically everything and I'm writing all this text for nothing, of course, but I don't know that for sure...

aofarrel commented 3 years ago

Decisions made:

Leave the chr number as-is, ie, don't implement the . restriction
But... don't make this sort of thing a habit

aofarrel commented 3 years ago

Per this, the 'chr' check is now in the second task. It is not in the first task as the CWL is designed to work with or without chr in that task and that task only.

aofarrel commented 3 years ago

Please look over the changes to the checker workflow (which is now two workflows) and the documentation.

aofarrel commented 3 years ago

Cost estimates for the workflow + checker will be added once fully non-call cached runs complete on Terra. (Tests already completed successfully on Terra + local, they just had their earlier steps cached.)

aofarrel commented 3 years ago

Renamed checkers and their respective JSONs passed on Terra.

DataBiosphere / analysis_pipeline_WDL

Implement merge_gds and check_merge_gds #15