bbglab / intogen-plus

a framework for automatic and comprehensive knowledge extraction based on mutational data from sequenced tumor samples from patients.
https://www.intogen.org/search
Other
1 stars 2 forks source link

How to prepare inputs? #41

Closed RyulKim-Inocras closed 1 week ago

RyulKim-Inocras commented 3 weeks ago

Hi,

I would like to know know how to prepare the input files to run the IntOGen-Plus. I cound'nt find any documentation explanining the detailed input format for the pipeline. Are there any example files available? If so, they would be greatly helpful.

Best regards,

RK

FedericaBrando commented 3 weeks ago

Hi RK,

Thanks for reaching out to us. It's true, the documentation in the repo is wrongly redirecting to an outdated one, we will fix it.

Here you can find the right information on input and output: https://intogen-plus.readthedocs.io/en/v2024/usage.html#input-output

Please let us know if we can clarify some steps that is not explained in the docs. Thanks

Fede

RyulKim-Inocras commented 3 weeks ago

Hi RK,

Thanks for reaching out to us. It's true, the documentation in the repo is wrongly redirecting to an outdated one, we will fix it.

Here you can find the right information on input and output: https://intogen-plus.readthedocs.io/en/v2024/usage.html#input-output

Please let us know if we can clarify some steps that is not explained in the docs. Thanks

Fede

Thank you for your reply.

However, there still seems to be limited information on how to prepare input files. Could provide the example files for beginners?

Regards,

RK.

FedericaBrando commented 3 weeks ago

Sure, the example is in the repo, here specifically: https://github.com/bbglab/intogen-plus/tree/master/test/pipeline/input/cbioportal_prad_broad

you'll find a MAF file (the txt) and the instructions (the yaml file) to feed intogen in order to parse it. As it is mentioned in the documentation:

[!IMPORTANT] All mutations should be mapped to the positive strand. The strand value is ignored.

In addition, each cohort must be associated with:

cohort ID (DATASET): a unique identifier for each cohort.

a cancer type (CANCER): although any acronym can be used here, we recommend to restrict to the acronyms that can be found in extra/data/dictionary_long_name.json.

a sequencing platform (PLATFORM): WXS for whole exome sequencing and WGS for whole genome sequencing

a reference genome (GENOMEREF): only HG38 and HG19 are supported

Cohort file names, as well as the fields mentioned above must not contain dots.

The way to provide those values is through OpenVariant , a comprehensive Python package that provides different functionalities to read, parse and operate different multiple input file formats (e. g. tsv, csv, vcf, maf, bed). Whether you are planning to run single or multiple cohorts, you would need to provide an annotation file in yaml format to specify the above mentioned structure required by IntOGen. Instructions on how to build an annotation file are documented here: OpenVariant annotation file .

We use Openvariant to parse the input, therefore the input files should stick to the structure that openvariant requires. We have extensive information on the openvariant documentation and repository where you can read and find more examples of input data.

Hope this is clears your doubts

Bests,

Federica

FedericaBrando commented 1 week ago

Hi @RyulKim-Inocras , since no update was provide I will close this issue as completed, please feel free to further comment if needed.

RyulKim-Inocras commented 5 days ago

Problem solved! thank!