cbg-ethz / COMPASS

GNU General Public License v3.0
16 stars 9 forks source link

Questions about focal and broad CNVs #16

Open andreyurch opened 2 weeks ago

andreyurch commented 2 weeks ago

Dear developers,

Thank you very much for this advanced tool. I have a number of questions on modelling of CNVs on Tapestri data (thousands of cells and 400 amplicons):

  1. I am interested in both broad (chromosome or arm) and focal events (single gene). Can I provide simultaneously gene-level and arm-level matrix? For example, if I have 4 genes on the chromosomal arm, can I make a matrix with read counts for TP53 and at the same time for chr17p where I have in total 4 genes including TP53?
  2. Is it possible to analyse only chromosomal events without mutation matrix?
  3. Can I detect focal amplifications with very high copy number (like 15x) for some genes?
  4. There is a possibility to run multiple chains in parallel. Will it reduce the computational time?
  5. Btw may be you have a preprocessing script to start with h5 files instead of loom files?
e-sollier commented 2 weeks ago

Hi,

Thanks for your interest in COMPASS!

  1. No, you cannot simultaneously call CNVs at the gene level or at the chromosome arm level. You can try both methods one after the other, but I would rather recommend calling CNVs at the gene level.
  2. No, you cannot analyze only CNVs without SNVs with COMPASS. This is mainly because I found that SNVs were more reliably detected from Tapestri data than CNVs, so COMPASS will first infer a phylogeny of SNVs, and then add the CNVs.
  3. No, COMPASS does not estimate precise copy numbers in cases of gains. Again, this is mainly because this would not be very precise. So if you have copy number>10, COMPASS should just report this as a gain.
  4. It is possible to run multiple chains in parallels, and this does reduce computational time, but this requires COMPASS to be compiled with OpenMP.
  5. I only have preprocessing scripts for loom files. Did MissionBio change their file formats? I haven't worked with Tapestri data recently, but loom used to be their standard format.
andreyurch commented 2 weeks ago

Thank you very much for your answers!

  1. I think that it would be a good addition for future if you can implement high-level amplifications into the model. A gain of one or two additional copies of KRAS is not a biologically meaningful event, but gain of more than 10 copies is a very clinically important alteration. In fact, we do not need to know the exact copy numbers in the case of high-level amplifications, it can be categorised like 5+, 10+, 20+ and will be very useful.
  2. Yes, now the main format is h5: "H5 files are a replacement of loom files. These are part of the DNA and protein pipeline output." MOSAIC pipeline is completely based on h5: https://missionbio.github.io/mosaic/notebooks/overview.html

And the last question: if I preprocess my h5 files myself, how I have to normalise read count (CNV) table for genes? Should I take a sum or a mean of the amplicons? How does it work for whole chromosomes?

e-sollier commented 2 weeks ago

You can sum the read counts in each region (gene or chromosome) if you preprocess the data yourself.

andreyurch commented 2 weeks ago

You can sum the read counts in each region (gene or chromosome) if you preprocess the data yourself.

Thank you. I found that whitelist file for preprocessing contains a lot of information (which I do not have for my dataset). Is it possible to use the whitelist with only chr,pos,ref,alt?

e-sollier commented 2 weeks ago

Yes, most of the columns in the example mutations.csv are useless for the preprocessing script. You only need: sample ID, chr, start, ref allele, alt allele (the names of the columns are important). sample ID has to match the name of the loom file (e.g. if the loom file is sampleX.loom, then sample ID should be sampleX.