cbg-ethz / COMPASS

GNU General Public License v3.0
17 stars 9 forks source link

Multiple questions about the preprocess script #4

Closed reJELIN closed 1 year ago

reJELIN commented 1 year ago

Hi,

I'm trying to develop a single-cell DNA pipeline using tapestri missionbio so your tool is very interesting. As you may know missionbio releases a python package "mosaic". In this package they propose to filter_variants using those basis parameters (https://missionbio.github.io/mosaic/pages/methods/missionbio.mosaic.dna.Dna.filter_variants.html#missionbio.mosaic.dna.Dna.filter_variants) :

min_dp=10, min_gq=30, vaf_ref=5, vaf_hom=95, vaf_het=35, min_prct_cells=50, min_mut_prct_cells=1

from what i can see in your preprocess.py script you're not using those metric. instead you're using differents parameters to filter your variants :

Is there a way to fit the missionbio filtering method ? if i'm using the whitelist argument would it use only variants in the whitelist or would it use whitelist AND add variants that are passing filtering ?

by the way i'm trying to reproduce a mutation.csv from my samples i'm encountering this issue :

RuntimeWarning: invalid value encountered in long_scalars
  if count_loci_genotyped/len(prefiltered_loci)>= min_frac_loci_genotyped:

in my csv file i'm keeping only sample ID, chr, start, ref allele and alt allele is there a way to reproduce your mutation.csv from my own sample ?

Thanks you in advance for your answer and your time, your help would be very precious !

e-sollier commented 1 year ago

My preprocessing script is indeed inspired by the method used in mosaic, but an important difference is that my script attempts to select variants which are not present in all of the cells (so either somatic variants or germline variants which are lost following a deletion or copy-neutral loss of heterozygosity), because those are the variants that are informative for the phylogeny. In principle, COMPASS will still work if you include germline variants (they will be placed at the root of the tree), but it will be more efficient if we remove such uninformative variants. If this would be useful to you, I could write an other preprocessing script which follows exactly the mosaic method.

Concerning the parameters that you mentioned:

If you use a whitelist, the preprocessing script will only select variants in this whitelist.

I suspect that your error might be caused by len(prefiltered_loci) being 0. Could you try to add:

print(whitelist)
print(prefiltered_loci)

before the for loop at https://github.com/cbg-ethz/COMPASS/blob/master/Experiments/preprocessing/preprocess.py#L199 , so that I can try to understand where the problem is coming from ?

reJELIN commented 1 year ago

Hi E-sollier,

Thanks you a lot for your answers !

so i find a solution to my problem, apparently you had different samples in the missionbio data that you worked with, it was not my case so i just comment this line and it worked https://github.com/cbg-ethz/COMPASS/blob/017d694741c82208abe678428397795e46dd3b86/Experiments/preprocessing/preprocess.py#L87

e-sollier commented 1 year ago

OK, let me know if you have other issues, I'd be happy to help!

reJELIN commented 1 year ago

Thanks for your time !

It is not an issue but how and why sometime i have a node annotated "LOH gene1 ref alt" ? how does it happen ?

e-sollier commented 1 year ago

It is indeed not very clear, sorry about that! It means that there is a loss of heterozygosity on this gene. This gene contains 2 SNVs: for the first one, the reference allele is lost (so the cells are homozygous alternative); for the second SNV, the alternative allele is lost (so the cells are homozygous reference). COMPASS distinguishes between LOH caused by a copy number loss or copy-neutral loss of heterozygosity, but some amplicons are noisy and COMPASS cannot reliably tell whether a loss or a CNLOH happened, and in this case it only reports "LOH". I hope this clarifies, don't hestitate if you have further questions.

reJELIN commented 1 year ago

thanks is it indeed much clear for me !

I also observed that I have a difference of rows number between my sample.dna.barcodes() (i have 4557 different barcodes) and the output_cellAssignments.tsv (only 4547 rows), perhaps you could tell why ? it would really help if it could give us barcode assign to each cell.

Again thanks you a lot for your quick answer and your time !

e-sollier commented 1 year ago

COMPASS should output as many cells in _cellAssignments.tsv as were present in the input files.

I guess your 10 missing cells were filtered in the preprocessing. If you set min_frac_loci_genotyped to 0 in my preprocessing script, all of the cells will be kept (but then if you keep cells for which you don't have genotyping information for most loci, the assignments to the nodes will be unreliable for those cells).