cortes-ciriano-lab / SComatic

A tool for detecting somatic variants in single cell data
Other
173 stars 28 forks source link

Step4.2 fails to output any variants from scATACseq data #66

Open archieandrews10 opened 1 month ago

archieandrews10 commented 1 month ago

Hello: Step4.2 is failing to output any variants from scATACseq data. The bam file I am working on belongs to only one cell type. I was able to run the Steps 1 to 4.1 (Skipped step 3 since I was using only one celltype) using the parameters for scATACseq without any issues and the output of Step 4.1 produces a 8 Gb variant file containing a total of 56842458 variant lines (Output containing first 100 lines is attached). However, when the Step 4.2 was executed, I lost all variants and only the first few description lines were output (Output attached). The text file 'script.txt' lists all the commands and parameters used in the execution of these steps. When I audited the output of Step 4.1, I noticed that the values under ALT field were just '.' (see the output containing attached). Can you let me know what is causing this and how I can resolve this?

output_file_step_4_2.xlsx output_file_step_4_1.xlsx script.txt

ArthurDondi commented 1 month ago

Hi! First of all, with only one cell type you need to add the option --min_cell_types 1 to your BaseCellCalling.step1.py command, otherwise you will have no PASS (somatic) variants in the FILTER column of step 4.2 output, only variants tagged Min_cell_types. But that should not be your problem here as this would not cause an empty file in step 4.2.

Step 4.1 tracks all covered positions, not only variants, so your 56842458 lines are not variants but just positions with 5+ coverage. Therefore you should have hundreds to thousands of . ALT (reference) for one variant line, and the output_file_step_4_1.xlsx provided is expected to be only ".". You can check it by running: awk -F'\t' '{if ($6 != ".") {print $0}}' output.step4.1.tsv > only_alt_output.step4.1.tsv if only_alt_output.step4.1.tsv is empty I would be very surprised.

I think I identified your problem : during step 3, the POS column becomes Start and End columns (with same values), and skipping step 3 mess up steps 4.2. If step 3 doesnt work with one file, just try replacing the POS column by Start and End columns using Pandas in python for example. Then rerun steps 4.1 and 4.2, hopefully this should work.

archieandrews10 commented 1 month ago

Thanks. The modification worked on the cluster-level bams and I was able to call the variants in step 4.2. But since the variants were for the entire cell-type cluster. I went ahead and used the step 1 script again, with the cluster-level bam files as an input file and provided the bar-codes of each cell within the cluster to further split them into single-cells. The output directory had single-cell bams for each cell, indexed. However, I am receiving the below error. Any idea what is causing this and how to resolve this? Thank you for your help in this.

(SComatic) adam@lambda-scalar:/storage5/work/adam/scomatic/test_bams$ ./step2.sh
Outfile:  /storage5/work/adam/scomatic/test_bams/output_directory_Cancer_Cell_298_sorted/Sample.Cancer_Cell_298_sorted.tsv 

Directory  /storage5/work/adam/scomatic/test_bams/temp_directory_test_bams  created

No temporary files found
Computation time: 33 seconds

Below are some of the QC I have performed to debug:

The headers for both the cluster-level and single-cell BAM files contain similar information, including the @RG (read group) tag and the sequence names (@SQ). They both also have read group entries like @RG and identical sequence length information for chromosomes​.

There are no major discrepancies in the basic structure or content of the headers that would directly indicate why single-cell BAMs are causing an issue during processing.

The cluster-level BAM has significantly more reads (73,468,688) compared to the single-cell BAM (395,022)​(single_cell_stats). The size difference is expected since single-cell BAMs are subsets of the larger cluster-level BAM.

The proportion of properly paired reads and other paired-end statistics between the two BAMs are quite similar.

I re-sorted the single bams based on coordinates and re-indexed it, but, it did not work, same error. I set the --min_mq to 0 and tried again, it did not work. I get the same error. I set the --max_nM and --max_NH as None, it still did not work, I get the same error.

ArthurDondi commented 1 month ago

Running SComatic on single-cells makes little sense imo (I am not the developer, just a contributor). You want to run SComatic on all the cells from a sample/dataset and detect mutations exclusive to each cell type. Then, if you want to detect mutations in single cells, have a look at https://github.com/cortes-ciriano-lab/SComatic/blob/main/docs/OtherFunctionalities.md. Keep only PASS mutations after step 4.2 and use SingleCellGenotype.py.

archieandrews10 commented 1 month ago

I successfully called variants at the single-cell level based on your suggestion. I believe I have two options for calling variants using the parameters for scATAC-seq. The first option discussed in this thread has worked well. The second option is the default pipeline you recommended, which I haven't tested yet. Does the description and sequence of steps listed in second option seem appropriate?

First Option: Run the Steps 1, 2, Skip step 3 (since using only one celltype), replace the POS column by Start and End columns using Pandas, continue with steps 4.1 and 4.2, Keep only PASS mutations after step 4.2 and use SingleCellGenotype.py.

Second Option: Run the Steps 1, 2, 3, 4.1 and 4.2, Keep only PASS mutations after step 4.2 and use SingleCellGenotype.py.

ArthurDondi commented 1 month ago

Both should work, it's basically the same thing provided step 3 accepts a single file as input. Don't forget the --min_cell_types 1 option in step 4.1 if you only have one cell type.