Multiple questions about the preprocess script

reJELIN commented 1 year ago

Hi,

I'm trying to develop a single-cell DNA pipeline using tapestri missionbio so your tool is very interesting. As you may know missionbio releases a python package "mosaic". In this package they propose to filter_variants using those basis parameters (https://missionbio.github.io/mosaic/pages/methods/missionbio.mosaic.dna.Dna.filter_variants.html#missionbio.mosaic.dna.Dna.filter_variants) :

min_dp=10, min_gq=30, vaf_ref=5, vaf_hom=95, vaf_het=35, min_prct_cells=50, min_mut_prct_cells=1

from what i can see in your preprocess.py script you're not using those metric. instead you're using differents parameters to filter your variants :

min_frac_cells_genotyped from what i understand it is a threshold to keep loci that are at least genotyped in a minimum of cells corresponding more or like to the "min_prct_cells" of missionbio mosaic package
min_frac_loci_genotyped is the minimum percentage cells in which variants is present corresponding more or like to the "min_mut_prct_cells" of missionbio mosaic package
min_AF from i could read,feel free to correct if i misunderstand, you're assigning missing genotype for each variants in each barcodes/cells that is not passing the min_GQ and min_DP threshold. If so isn't it a bit stringent ? Missionbio with their package is approximately doing the same but is using different threshold for each genotype (ref,het,hom)
min_frac_cells_alt is a threshold where you're filtering cells were a loci genotyped as alt is too fewly express ?

Is there a way to fit the missionbio filtering method ? if i'm using the whitelist argument would it use only variants in the whitelist or would it use whitelist AND add variants that are passing filtering ?

by the way i'm trying to reproduce a mutation.csv from my samples i'm encountering this issue :

RuntimeWarning: invalid value encountered in long_scalars
  if count_loci_genotyped/len(prefiltered_loci)>= min_frac_loci_genotyped:

in my csv file i'm keeping only sample ID, chr, start, ref allele and alt allele is there a way to reproduce your mutation.csv from my own sample ?

Thanks you in advance for your answer and your time, your help would be very precious !

e-sollier commented 1 year ago

My preprocessing script is indeed inspired by the method used in mosaic, but an important difference is that my script attempts to select variants which are not present in all of the cells (so either somatic variants or germline variants which are lost following a deletion or copy-neutral loss of heterozygosity), because those are the variants that are informative for the phylogeny. In principle, COMPASS will still work if you include germline variants (they will be placed at the root of the tree), but it will be more efficient if we remove such uninformative variants. If this would be useful to you, I could write an other preprocessing script which follows exactly the mosaic method.

Concerning the parameters that you mentioned:

min_frac_cells_genotyped: corresponds indeed to min_prct_cells in mosaic
min_frac_loci_genotyped: this does not correspond to min_mut_prct_cells in mosaic. This parameter is used for filtering cells, not loci. The script only keeps cells for which more than this fraction of loci are genotyped.
min_AF: Cells with an allele frequency < min_AF for one locus are assigned a missing genotyped for this locus during the selection of loci. In the MissionBio data that I worked with, there were some loci which had a high sequencing error rate (all cells had ~5-10% of alt reads), so this filter attempts to remove them. This genotype is only used for the selection of loci; during the actual tree inference, COMPASS will use the read counts, not the called genotypes, so even if a cell is assigned a missing genotype here, COMPASS will still use the read information for this cell, if this locus is selected.
min_frac_cells_alt: this corresponds to min_mut_prct_cells in mosaic.

If you use a whitelist, the preprocessing script will only select variants in this whitelist.

I suspect that your error might be caused by len(prefiltered_loci) being 0. Could you try to add:

print(whitelist)
print(prefiltered_loci)

before the for loop at https://github.com/cbg-ethz/COMPASS/blob/master/Experiments/preprocessing/preprocess.py#L199 , so that I can try to understand where the problem is coming from ?

reJELIN commented 1 year ago

Hi E-sollier,

Thanks you a lot for your answers !

so i find a solution to my problem, apparently you had different samples in the missionbio data that you worked with, it was not my case so i just comment this line and it worked https://github.com/cbg-ethz/COMPASS/blob/017d694741c82208abe678428397795e46dd3b86/Experiments/preprocessing/preprocess.py#L87

e-sollier commented 1 year ago

OK, let me know if you have other issues, I'd be happy to help!

reJELIN commented 1 year ago

Thanks for your time !

It is not an issue but how and why sometime i have a node annotated "LOH gene1 ref alt" ? how does it happen ?

e-sollier commented 1 year ago

It is indeed not very clear, sorry about that! It means that there is a loss of heterozygosity on this gene. This gene contains 2 SNVs: for the first one, the reference allele is lost (so the cells are homozygous alternative); for the second SNV, the alternative allele is lost (so the cells are homozygous reference). COMPASS distinguishes between LOH caused by a copy number loss or copy-neutral loss of heterozygosity, but some amplicons are noisy and COMPASS cannot reliably tell whether a loss or a CNLOH happened, and in this case it only reports "LOH". I hope this clarifies, don't hestitate if you have further questions.

reJELIN commented 1 year ago

thanks is it indeed much clear for me !

I also observed that I have a difference of rows number between my sample.dna.barcodes() (i have 4557 different barcodes) and the output_cellAssignments.tsv (only 4547 rows), perhaps you could tell why ? it would really help if it could give us barcode assign to each cell.

Again thanks you a lot for your quick answer and your time !

e-sollier commented 1 year ago

COMPASS should output as many cells in _cellAssignments.tsv as were present in the input files.

I guess your 10 missing cells were filtered in the preprocessing. If you set min_frac_loci_genotyped to 0 in my preprocessing script, all of the cells will be kept (but then if you keep cells for which you don't have genotyping information for most loci, the assignments to the nodes will be unreliable for those cells).

cbg-ethz / COMPASS

Multiple questions about the preprocess script #4