Closed RyulKim-Inocras closed 1 week ago
Hi RK,
Thanks for reaching out to us. It's true, the documentation in the repo is wrongly redirecting to an outdated one, we will fix it.
Here you can find the right information on input and output: https://intogen-plus.readthedocs.io/en/v2024/usage.html#input-output
Please let us know if we can clarify some steps that is not explained in the docs. Thanks
Fede
Hi RK,
Thanks for reaching out to us. It's true, the documentation in the repo is wrongly redirecting to an outdated one, we will fix it.
Here you can find the right information on input and output: https://intogen-plus.readthedocs.io/en/v2024/usage.html#input-output
Please let us know if we can clarify some steps that is not explained in the docs. Thanks
Fede
Thank you for your reply.
However, there still seems to be limited information on how to prepare input files. Could provide the example files for beginners?
Regards,
RK.
Sure, the example is in the repo, here specifically: https://github.com/bbglab/intogen-plus/tree/master/test/pipeline/input/cbioportal_prad_broad
you'll find a MAF file (the txt) and the instructions (the yaml file) to feed intogen in order to parse it. As it is mentioned in the documentation:
[!IMPORTANT] All mutations should be mapped to the positive strand. The strand value is ignored.
In addition, each cohort must be associated with:
cohort ID
(DATASET): a unique identifier for each cohort.a
cancer type
(CANCER): although any acronym can be used here, we recommend to restrict to the acronyms that can be found in extra/data/dictionary_long_name.json.a sequencing platform (
PLATFORM
): WXS for whole exome sequencing and WGS for whole genome sequencinga reference genome (
GENOMEREF
): only HG38 and HG19 are supportedCohort file names, as well as the fields mentioned above must not contain dots.
The way to provide those values is through OpenVariant , a comprehensive Python package that provides different functionalities to read, parse and operate different multiple input file formats (e. g. tsv, csv, vcf, maf, bed). Whether you are planning to run single or multiple cohorts, you would need to provide an annotation file in yaml format to specify the above mentioned structure required by IntOGen. Instructions on how to build an annotation file are documented here: OpenVariant annotation file .
We use Openvariant to parse the input, therefore the input files should stick to the structure that openvariant requires. We have extensive information on the openvariant documentation and repository where you can read and find more examples of input data.
Hope this is clears your doubts
Bests,
Federica
Hi @RyulKim-Inocras , since no update was provide I will close this issue as completed, please feel free to further comment if needed.
Problem solved! thank!
Hi,
I would like to know know how to prepare the input files to run the IntOGen-Plus. I cound'nt find any documentation explanining the detailed input format for the pipeline. Are there any example files available? If so, they would be greatly helpful.
Best regards,
RK