core-unit-bioinformatics / reference-container

Build repository for reference container
MIT License
0 stars 0 forks source link

add complex data transformations #9

Open ptrebert opened 1 year ago

ptrebert commented 1 year ago

for reference data transformations that are more complex (or if issue #6 is not solved in the near future), add functionality that a separate snakemake module is dynamically loaded (defined in container config yaml) that contains the necessary rules implementing these transformations.

svenwillger commented 1 year ago

A new branch issue_data_transformation has been created.

Functionality to dynamically load an external module through config definitions has been added. Only when the key use_data_transformation is set to True the data_transformation.smk module will be activated. If that is done the key data_transformation_workflow must provide the location of a Snakemake file.

The message of the commit is add functionality to import external module

svenwillger commented 1 year ago

The module convertfasta.smk has been added to the workflow in the branch issue_data_transformation. This workflow is getting activated if in a config.yaml the entry convert_fasta is set to True. I also uploaded yaml files for the 2 mouse genomes mm10-GRCm38 and mm11-GRC39m that activate the new module. This new module takes the fasta file (downloaded from an ftp server) and creates a table that connects the chromosome names in UCSC-style with GenBank-AccessionID. Then a copy of the fasta file will be created with a "_original" addition, while in the {filename}.fasta file the headers containing the GenBank AccessionIDs will be replaced with the UCSC-style chromosome names. A conversion table that contains the original header names and the new header names will also be created. Since snakemake can't handle 2 derive commands in one run and the fasta file has been modified additional rules have been added that take the modified fasta file and creates the {filename}.fasta.fai and {filename}.fasta.dict files. The MANIFEST and the to-be-build container contain then the original fasta file, the modified fasta file, conversion_table, fai file and dict file.

FYI, I also used this template to generate a container for the T2T genome with the UCSC-style chromosome names; however, some minor changes needed to be made, but those information and the container are not part of this commit.