bbglab / intogen-plus

a framework for automatic and comprehensive knowledge extraction based on mutational data from sequenced tumor samples from patients.
https://www.intogen.org/search
Other
0 stars 1 forks source link

IntOGen-plus | unfiltered drivers shows NaNs #19

Closed FedericaBrando closed 5 months ago

FedericaBrando commented 5 months ago

In the unfiltered_drivers.tsv, we end up with some genes with nans in the following columns:

TRANSCRIPT  COHORT  CANCER_TYPE MUTATIONS   SAMPLES_COHORT

this is probably due to a merge between cohort.tsv and the {COHORT}.drivers.tsv.

Further investigation is needed.

FedericaBrando commented 5 months ago

This is probably due to some gene having an empty gene name. This can be solved by using another identifier for those specific rows.

FedericaBrando commented 5 months ago

Upon inspection of the code in DriverSummary, the above mentioned problem are due to this lines, where get the name of the cohort to annotate the vet dataframe is taken from the drivers one.

Although a major problem is that the zip does not imply that the drivers dataframe and the vet dataframe are from the same cohort. There fore this leads to the appending of a nan to those cohort that do not have a driver dataframe, but do have a vet dataframe or leading to mistakenly annotate a certain cohort to a different cohort vet dataframe.

By relying on the file name, it would solve the issue.