Open adhisadi opened 8 months ago
I also want to know how it deals with NAs in the lcn.em column. I have many NAs. I am doing multisample analysis. So, let's assume I remove lines with NAs in this file. The final cnv input for phylowgs has information for each sample separated by semicolon. So I am wondering if one sample has NA and other doesnot, do I loose the value for this section for all samples?
Whooo - it's been a while since I worked on this....
I mean - the key error suggests to me the file isn't formatted correctly, I believe I assumed a CSV input in the facets parser - so a switch to that might just fix it
reader = csv.DictReader(facetf)
But, are you sure you want to use PhyloWGS for tumor phylogeny? I don't work in cancer anymore, but I suspect the field has developed something better since this...
There's a major problem (in my opinion) that they model CNV's as single SNPs - they then make the assumption (true for SNPs but arguably very much false for CNVs) that a mutation can only happen once in the course of tumor evolution. So if say, you have a duplication at a gene, and then as the tumor progresses, that gene is amplified again, the software will break in unexpected ways.
Thanks for the quick reply, that worked. I agree with you, I was also thinking whether these assumptions hold true anymore. But yes, phylowgs and pyclone especially, are still being used for whole exome and whole genome sequencing data.
This might be a separate issue for phylowgs github, but may be you could suggest me from your experience. I want to create the cnv_data.txt file required for phyloWGS. I only have txt files for my snvs+indels with all the required columns for phyloWGS (id, gene, number of reference-allele, total number of reads at variant allele). I do not have vcf. I was trying to play around the [create_phylowgs_inputs.py] script to start with the txt file because I believe the script somewhere detects these columns from the vcf. I was thinking I could find that function within the script and only run the function following after that one. I am a learner in python, I could not figure out which function could it be and this script is too long and has too many functions/variable for a beginner like me to go one by one, and understand. If I just write a script to get output with snvs in cnv region, and merged the samples (for multisample analysis) with semicolon for creating cnv_data.txt, would it do the same thing? I might need to remove the cnvs that have NAs in lcn column though.
I get this error when running the parser with facets.tsv file. In my tsv file i have these columns:
Traceback (most recent call last): File "/Users/adhisadi/workspace/analysis/parse_cnvs_facets_extension.py", line 232, in
main()
File "/Users/adhisadi/workspace/analysis/parse_cnvs_facets_extension.py", line 228, in main
regions = parser.parse()
^^^^^^^^^^^^^^
File "/Users/adhisadi/workspace/analysis/parse_cnvs_facets_extension.py", line 100, in parse
if (str.isdigit(record['tcn.em']) and str.isdigit(record['lcn.em'])):