cognoma / cancer-data

TCGA data acquisition and processing for Project Cognoma
Other
20 stars 28 forks source link

Reshape mutation matrix for use by core-service repository #34

Closed stephenshank closed 4 years ago

stephenshank commented 7 years ago

The current format of the mutation matrix leads to some complications in the core-service repository. A more desirable format to work with for the purpose of populating the core-service mutation model would be of the form:

sample_id   entrez_gene_id
TCGA-18-3406-01 1
TCGA-38-4631-01 1
...
stephenshank commented 7 years ago

See #35.

dhimmel commented 7 years ago

I agree this is an important step, but I think we will want to do it slightly differently than in #35. I think we should do the processing in 2.TCGA-process.ipynb where the mutation data starts out in a melted format. I also think we may want to add some additional columns like mutation severity which will be useful for the frontend in the future.

Until we sort these things out, can you use the workaround here for https://github.com/cognoma/core-service/pull/42 (which is a super high priority PR, so let's complete that ASAP):

path = 'mutation-matrix.tsv.bz2'
read_file = bz2.open(path , 'rt')
reader = csv.DictReader(read_file, delimiter='\t')
for row in reader:
    sample_id = row.pop('sample_id')
    for entrez_gene_id, mutation_status in row.items():
        if mutation_status == '1':
            # Create mutation from entrez_gene_id, sample_id
reader.close()
stephenshank commented 7 years ago

bz2 module for the win! So simple this way!!