fritzsedlazeck / Sniffles

Structural variation caller using third generation sequencing
Other
532 stars 89 forks source link

Query on multiple-sample somatic data #353

Open MartinezRuiz-Carlos opened 1 year ago

MartinezRuiz-Carlos commented 1 year ago

Hi,

I have been using Sniffles2 on ONT DNA data from tumour samples, so far so good. I had just a quick question about the multiple-sample mode and how appropriate it would be to use in our case. We have two tumour samples and a normal adjacent sample per patient, and we are of course interested in somatic SVs in the tumour samples. Here is a workflow I had in mind to make the most out of Sniffles2, but wanted to check if it is an appropriate use of the tool:

Does this all make sense or am I fundamentally misunderstanding how the multiple-sample mode works? Many thanks!

Carlos

fritzsedlazeck commented 1 year ago

Hi Carlos, What we are currently doing is to run all 3 samples with the --snf parameter. Then do the merging. We are also experimenting with the --genotype parameter for the cancer-only sample.

After that single sample run you can merge them using sniffles to provide the 3 snf files. This will give you a squared off VCF file. So no need to rerun or do forcecalling. The filtering can then happen over the resulting VCF file.

Hope that helps Fritz

MartinezRuiz-Carlos commented 1 year ago

Thank you Fritz, that makes a lot of sense. And just to be clear, the individual runs on each sample with --snf would also include the --non-germline parameter? Or should this parameter be used only for the tumour samples, but not the normal?

fritzsedlazeck commented 1 year ago

So the --non-germline is for mosaic , low variant frequency mutations. They could be of interested for cancer. We are currently running benchmarks to see if it is better to run this on the cancer samples but then default parameter on the normal samples..

MartinezRuiz-Carlos commented 1 year ago

Ok great, I will run it as such for now, with the flag on for tumour samples but not for the normals. Thank you!

MartinezRuiz-Carlos commented 1 year ago

This ran beautifully, just a very quick question. I see that for many called SVs, the genotypes are 0/0 in all samples. My understanding is that this means Sinffles calls this variant in one sample, but when aggregating it decides there isn't enough support in any of the samples. Is this correct? If so, how to interpret this? Should those variants simply be ignored? Many thanks again!

fritzsedlazeck commented 1 year ago

You mean after the merge right ? Yes that is the easiest way... although there is a bit of a catch. So if a variant is too low in read support compared to the overall number of reads at this location it also gets a 0/0. This would be the case is something is there but its not a 0/1 or 1/1 because of the read support. This usually doesn't happen often but just to be aware that there is a more suttle thing.

MartinezRuiz-Carlos commented 1 year ago

Yes, this is after the merge. The issue we are having is that this happens quite often, we lose about 2/3 of variants because they are 0/0 everywhere. Could this have to do with the fact that tumour samples were run with --non-germline? Does the merging expect higher support?

Below is the code I used to get the VCF

PATIENT=$(sed -n "${SLURM_ARRAY_TASK_ID}p" tmp/samples.ids)

#Get all ONT IDs associated with the focal patient
ONT_SAMPLES=($(grep ${PATIENT} input/ds849_Project_Information_20220728.csv | cut -d ',' -f 1))

#Run sniffles per sample
for THIS_SAMPLE in ${ONT_SAMPLES[@]}
do

  #Get the bam of interest
  INPUT_BAM=$(ls input/${THIS_SAMPLE}_*bam)

  #Get the ID associated with the BAM
  REGION_ID=$(grep ${THIS_SAMPLE}, sample_info.csv)
  SAMPLE_NAME=$(echo ${PATIENT}"_"${REGION_ID})

  #Add --non-germline if the sample is tumour
  if [ $REGION_ID = "SU_N" ]; then
    GERMLINE_OPTS="--allow-overwrite"
  else
    GERMLINE_OPTS="--allow-overwrite --non-germline"
  fi

  sniffles --threads 16 \
           --reference input/hg38.fa \
           --tandem-repeats input/human_GRCh38_no_alt_analysis_set.trf.bed \
           ${GERMLINE_OPTS} \
           --input ${INPUT_BAM} \
           --snf tmp/${SAMPLE_NAME}.snf

done

#Run sniffles in multi_sample mode
#Get all SNF samples per patient
ALL_SNF=($(ls tmp/${PATIENT}*snf))
sniffles --input ${ALL_SNF[@]} --vcf results/${PATIENT}.vcf