aleighbrown / pwgs_snakemake

Snakemake pipeline for running PhyloWGS on NIH Biowulf Cluster
4 stars 4 forks source link

Error with facets as input #2

Open adhisadi opened 8 months ago

adhisadi commented 8 months ago

I get this error when running the parser with facets.tsv file. In my tsv file i have these columns:

image

Traceback (most recent call last): File "/Users/adhisadi/workspace/analysis/parse_cnvs_facets_extension.py", line 232, in main() File "/Users/adhisadi/workspace/analysis/parse_cnvs_facets_extension.py", line 228, in main regions = parser.parse() ^^^^^^^^^^^^^^ File "/Users/adhisadi/workspace/analysis/parse_cnvs_facets_extension.py", line 100, in parse if (str.isdigit(record['tcn.em']) and str.isdigit(record['lcn.em'])):


KeyError: 'tcn.em'
adhisadi commented 8 months ago

I also want to know how it deals with NAs in the lcn.em column. I have many NAs. I am doing multisample analysis. So, let's assume I remove lines with NAs in this file. The final cnv input for phylowgs has information for each sample separated by semicolon. So I am wondering if one sample has NA and other doesnot, do I loose the value for this section for all samples?

aleighbrown commented 8 months ago

Whooo - it's been a while since I worked on this....

I mean - the key error suggests to me the file isn't formatted correctly, I believe I assumed a CSV input in the facets parser - so a switch to that might just fix it

reader = csv.DictReader(facetf)

But, are you sure you want to use PhyloWGS for tumor phylogeny? I don't work in cancer anymore, but I suspect the field has developed something better since this...

There's a major problem (in my opinion) that they model CNV's as single SNPs - they then make the assumption (true for SNPs but arguably very much false for CNVs) that a mutation can only happen once in the course of tumor evolution. So if say, you have a duplication at a gene, and then as the tumor progresses, that gene is amplified again, the software will break in unexpected ways.

adhisadi commented 8 months ago

Thanks for the quick reply, that worked. I agree with you, I was also thinking whether these assumptions hold true anymore. But yes, phylowgs and pyclone especially, are still being used for whole exome and whole genome sequencing data.

adhisadi commented 8 months ago

This might be a separate issue for phylowgs github, but may be you could suggest me from your experience. I want to create the cnv_data.txt file required for phyloWGS. I only have txt files for my snvs+indels with all the required columns for phyloWGS (id, gene, number of reference-allele, total number of reads at variant allele). I do not have vcf. I was trying to play around the [create_phylowgs_inputs.py] script to start with the txt file because I believe the script somewhere detects these columns from the vcf. I was thinking I could find that function within the script and only run the function following after that one. I am a learner in python, I could not figure out which function could it be and this script is too long and has too many functions/variable for a beginner like me to go one by one, and understand. If I just write a script to get output with snvs in cnv region, and merged the samples (for multisample analysis) with semicolon for creating cnv_data.txt, would it do the same thing? I might need to remove the cnvs that have NAs in lcn column though.