A small bug in selectSolution.R

ysbioinfo commented 5 years ago

Hi Gavin, I recently found a little bug to be fixed in selectSolution.R. I run TitanCNA using snakemake. I found some patients disappeared from the final optimalClusterSolution.txt but some patients appeared twice. For example, 07T disappeared but 107T appeared twice. I read the code and found the bug is in this line: phi2Samples <- grep(id, phi2Files, value=T) If my id is 07T, then grep will catch files of both 07T and 107T, and compare them together. You can imagine that in my case, 107T is always the winner in all conditions, so 07T disappeared from the optimalClusterSolution.txt and 107T appeared twice. I change this line to: phi2Samples <- grep(paste('/', id, sep = ''), phi2Files, value=T) to make sure the id is in the beginning of the filename. It solves my problem. I'm not good at coding and I think you should have a better way to fix this bug in next version of TitanCNA. Thanks again for making such a convenient pipeline!

Yang

gavinha commented 5 years ago

Hi @snoopy-448

Thanks for bringing this up. This was brought up previously by @lbeltrame in Issue #10 I haven't gotten around to fixing this but I'll try to take a look this soon. Glad to see you were able to make a quick fix.

Best, Gavin

ysbioinfo commented 5 years ago

Gavin, I have another question about the output of TitanCNA. I want to use the output from TitanCNA to run PhyloWGS. The team of PhyloWGS write a cnv_parser.py to transform the segs.txt to the format they need, but it seems that the parser is designed for an older version of TitanCNA and some column names of the segs.txt have changed now, so their parser cannot work on the latest version of Titan.

with open(self._titan_filename) as titanf:
      reader = csv.DictReader(titanf, delimiter='\t')
      for record in reader:
        chrom = record['Chromosome'].lower()
        cnv = {}
        cnv['start'] = int(record['Start_Position(bp)'])
        cnv['end'] = int(record['End_Position(bp)'])
        cnv['major_cn'] = int(record['MajorCN'])
        cnv['minor_cn'] = int(record['MinorCN'])

        clonal_freq = record['Clonal_Frequency']
        if clonal_freq == 'NA':
          cnv['cellular_prevalence'] = self._cellularity
        else:
          cnv['cellular_prevalence'] = float(clonal_freq) * self._cellularity

        cn_regions[chrom].append(cnv)

Above is a piece of their parser. It's obvious the Start_Position(bp)/End_Position(bp) are changed to Start_Position.bp./End_Position.bp. now. I wonder if the 'Clonal_Frequency' is renamed as 'Cellular_Prevalence' now in Titan. Are they the same? By the way, is Cellular_Prevalence the fraction of tumor cells harboring this CNV and I need to use Cellular_Prevalence * purity to get the fraction of all cells who harbor this CNV?

Thanks!

Yang

gavinha commented 5 years ago

Hi @snoopy-448

Yes, you are right. I had changed Cellular_Frequency to Cellular_Prevalence at some point and that might've broken their parser. Everything about this value is the same other than the new column name.

Sorry for the inconvenience!

-Gavin

ysbioinfo commented 5 years ago

Thanks so much!

MUppal commented 4 years ago

@snoopy-448 , in addition to modifying Clonal_Frequency to Cellular_Prevalence in the phylowgs parse_cnvs.py parser script, do you also modify MajorCN and MinorCN to Corrected_MajorCN and Corrected_MinorCN on lines 69-70 on that script to pull from those columns in Titan's *segs.txt?

gavinha commented 4 years ago

Hi @MUppal

The Corrected_MajorCN and Corrected_MinorCN are additional columns included after some correction during post-processing to allow copy number to be higher than the initial max (i.e. 8) in the model. This was included in commit 96e1c5bff8cf6f2793af40c3e463c85fe6fb3986 and brought up in #63

The original MajorCN and MinorCN columns are still there and it's up to you whether you like to use the corrected columns instead.

Best, Gavin

gavinha / TitanCNA

A small bug in selectSolution.R #52