Parse data into .ssm file

ethanumn commented 3 years ago

TODO

[x] Automate aggregation of xlsx files per patient
[x] Automate production of .ssm and .params.json file from aggregated xlsx file
[x] Add method to create garbage mutations entry in .params.json file (empty for now, but will build upon later)
[x] Make name in .ssm file gene_position
[x] Combine rows in .ssm for each chromosome_position into one row, where var_reads, total_reads, read_prob are lists of each samples chromosome_position in the same order (should be 148 lines, each line should have 29 comma separated values in these fields)
[x] Plot VAF distribution for each chromosome_pair for each sample (29 histograms)
[x] Change xlsx aggregation such that if a unique chromosome_pair does not exist for a sample in the master, and it's total read count is pulled from the all calls spreadsheet, that it's variant read count is set to 0
[x] When performing xlsx aggregation, output, if a unique chromosome-pair does not exist for a sample in the master, and it is pulled from the all calls spreadsheet, that it's variant read count is plotted on a histogram
[x] When performing xlsx aggregation, output, if a unique chromosome-pair does not exist for a sample in the master, and it does not exist in the all calls spreadsheet, plot its imputed total read count on a histogram
[x] Write tests for producing .ssm and .params.json files
[x] Write script to generate PDF with metrics about aggregation
[x] Write tests for aggregating xlsx files
[ ] Write script to automate generating inputs for all patient data

ethanumn commented 3 years ago

Per my current understanding - developed set of software to automate aggregation and production of .ssm and .params.json files

ethanumn commented 3 years ago

Was able to use generated inputs for MATS08 test data to run pairtree successfully. Unsure if the results make sense but it was able to utilize my inputs.

ethanumn commented 3 years ago

Wrote a set of basic test cases (encapsulated in a unittest) for .ssm files

ethanumn commented 3 years ago

Wrote a set of basic printouts when aggregating .xlsx files to make sure the number of aggregated variants, zero reads, etc. make sense

ethanumn commented 3 years ago

Made changes per request. Verified all changes using data in "example/" (comparing calls, master, aggregated, .ssm, params.json, etc.). Passed ssm tests.

Created a class to generate a pdf of metrics. Not the prettiest pdf but it will suffice (used matplotlib and pdfpages)

example.metrics.pdf

ethanumn commented 3 years ago

Added workaround to sort rows in dataframe/xls by chromosome number. Double checked all of the pandas merge calls. Added some more statements to be printed to output pdf.

ethanumn commented 3 years ago

https://github.com/ethanumn/mpn-aml-pairtree/blob/b604a8fd846a963b4766e937fa608f3955df7d5d/utils/xls_file/xls_aggregators/mpn_aml_aggregator.py#L183

Issue here is that calls_df overwrites VAF in aggregated_df, and therefore it shows up down the line even though the ALT_DEPTH has been set to zero. Solution is to drop the VAF from calls_df.

ethanumn / mpn-aml-pairtree

Parse data into .ssm file #6