HorvathLab / NGS

Next-Gen Sequencing tools from the Horvath Lab
https://horvathlab.github.io/NGS/
MIT License
39 stars 16 forks source link

empty results from scReadCounts #4

Closed rosaranli closed 2 years ago

rosaranli commented 2 years ago

Hello,

Thank you very much for such a nice tool.

When I tried to use the following command line to run the tool, I got three files with empty output. scReadCounts -r possorted_genome_bam.bam -s GRCh38.vcf_chr1.txt -o ${outdir}/GRCh38.vcf_chr1_scReadCounts.txt

possorted_genome_bam.bam is the 10x single cell data mapped using cell ranger. I tried different versions of vcf files, including the SNPs called from match exons, but non of them worked. I always got empty output, although sometimes it took a long time to run and everything looks ok for the process.

would it be possible to know what's the problem?

Thank you very much!

edwardsnj commented 2 years ago

Looking at your command-line, it looks like you are not setting the Read Group (cell-barcode) parameter correctly. The default is for UMI-tools, you probably want the STARsolo option to pick out the CB tags from the cell ranger BAM file.

See the documentation on Read Groups. Note that it expects a file called barcodes.tsv in the current working directory (this is output by STARsolo). Use -b None to omit this option.

Let me know if I've correct guessed the issue!

edwardsnj commented 2 years ago

I'm not as familiar with how Cell Ranger places the cell-barcodes into the BAM file as with STARsolo and UMI-tools. I thought it did this by adding a CB tag to each read's alignment, since it uses STARsolo internally (my undestanding). You can verify the CB tags on the aligned reads by using samtools view to look at the BAM file.

You can set the Read Grouping using the -G option (Read Groups in the user-interface).

Hope this helps...

rosaranli commented 2 years ago

Thank you very much for your help. It indeed made a difference when I specify Read Groups. I got the outputs now but two other questions came out:

Question 1: I did a test run on SNPs from chr3 only and I got 3 files for the output. file.txt, file.cnt.matrix.txt and file.vaf-m5.matrix.txt. I think I could guess what the matrix means from file.cnt.matrix.txt and file.vaf-m5.matrix.txt but I couldn't understand what each columns means for file.txt. It looks like the following:

1 14464 A T AACAAAGGTTCAGTAC-1 0 1 0 0 1 0 1 50.0 1.0 1 14464 A T AACCAACAGGGCAACT-1 0 0 0 1 0 1 1 0.0 0.0 1 14464 A T AACGTCAAGGTAATCA-1 0 0 0 1 0 1 1 0.0 0.0 1 14464 A T AAGCGAGTCATCAGTG-1 0 0 0 1 0 1 1 0.0 0.0 1 14464 A T ACCACAAGTTAATCGC-1 0 0 0 1 0 1 1 0.0 0.0 1 14464 A T ACTGATGTCTACCACC-1 0 0 0 1 0 1 1 0.0 0.0

It doesn't have headers so I couldn't guess what each column represents. I guess the last column is kind of fraction. I went to the website for the explanation of the output, but I can't get an one-to-one match to what was described in the website. Maybe I misunderstood I am just wondering if you could clarify what each column represents here.

Question 2: After I got the proper outputs from the test run, I ran the tool using the SNPs from all the chromosomes. It took a few days to run and generate the file.txt but an error happened it was like the following:

Read ReadCounts input files ->| **** (4:10 min) Output results

Traceback (most recent call last): File "scReadCounts.py", line 289, in File "execute.py", line 30, in execute AssertionError [24070] Failed to execute script scReadCounts

Do you know what might be the problem? I only got file.txt but not the other two files. I guess the file.cnt.matrix.txt and file.vaf-m5.matrix.txt can be got from the file.txt anyway so I ran the following command line:

readCountsMatrix -c "file.txt" -M Ref;Var -o "file.cnt.matrix.txt "

and it is still running.

Thank you very much for your help.

edwardsnj commented 2 years ago

Question 1: I did a test run on SNPs from chr3 only and I got 3 files for the output. file.txt, file.cnt.matrix.txt and file.vaf-m5.matrix.txt. I think I could guess what the matrix means from file.cnt.matrix.txt and file.vaf-m5.matrix.txt but I couldn't understand what each columns means for file.txt. It looks like the following:

1 14464 A T AACAAAGGTTCAGTAC-1 0 1 0 0 1 0 1 50.0 1.0 1 14464 A T AACCAACAGGGCAACT-1 0 0 0 1 0 1 1 0.0 0.0 1 14464 A T AACGTCAAGGTAATCA-1 0 0 0 1 0 1 1 0.0 0.0 1 14464 A T AAGCGAGTCATCAGTG-1 0 0 0 1 0 1 1 0.0 0.0 1 14464 A T ACCACAAGTTAATCGC-1 0 0 0 1 0 1 1 0.0 0.0 1 14464 A T ACTGATGTCTACCACC-1 0 0 0 1 0 1 1 0.0 0.0

It doesn't have headers so I couldn't guess what each column represents. I guess the last column is kind of fraction. I went to the website for the explanation of the output, but I can't get an one-to-one match to what was described in the website. Maybe I misunderstood I am just wondering if you could clarify what each column represents here.

Hi Ran,

The issue here is that you chose the ".txt" output format which is designed for other (linux) tools to process (and has no headers to ease this task). If you choose tsv (or csv or xls, or xlsx) format, the output files will have headers.

The .txt (or .tsv, etc.) file is the output from the readCounts program (see documentation elsewhere on this site, including documentation of the headers), but is not in matrix form. There is one row for each locus and cell-barcode combination. Generating this output file is where 99% of the work of scReadCounts is. The scReadCounts program will only re-generate this file if it is missing (so if it is good, don't delete it).

Hope this helps...

edwardsnj commented 2 years ago

Question 2: After I got the proper outputs from the test run, I ran the tool using the SNPs from all the chromosomes. It took a few days to run and generate the file.txt but an error happened it was like the following:

Read ReadCounts input files ->| **** (4:10 min) Output results

Traceback (most recent call last): File "scReadCounts.py", line 289, in File "execute.py", line 30, in execute AssertionError [24070] Failed to execute script scReadCounts

The short answer is that this error message just tells me that readCounts failed, not why. Possibly out of memory? Out of disk-space? Write permission error? It appears that readCounts was attempting to write out its output file, which means it was almost done. You could try to run the readCounts program directly using the command-line shown in the scReadCounts output. Perhaps you will see more information there to determine the possible issue?

Do you know what might be the problem? I only got file.txt but not the other two files. I guess the file.cnt.matrix.txt and file.vaf-m5.matrix.txt can be got from the file.txt anyway so I ran the following command line:

readCountsMatrix -c "file.txt" -M Ref;Var -o "file.cnt.matrix.txt "

and it is still running.

readCountsMatrix should only take a few seconds to run. I don't know why it would take a long time, but given the error message above, perhaps file.txt is not complete (or empty) and this is causing problems.

I'm sorry you are having such difficulties!

edwardsnj commented 2 years ago

@rosaranli I'm wondering if you figured out why this wasn't working for you, or if you moved on and found other alternatives. I recently had to run on some bigger BAM files and discovered a similar issue as yourself - for me it was the memory footprint that was problematic. I have done some work to reduce the memory requirements of readCounts and readCountsMatrix, and this made a big difference for me. I will release this update shortly.

edwardsnj commented 2 years ago

@rosaranli Release 1.2.0 contains these fixes to reduce memory footprint, and should (I hope) make a difference in how effectively this works for you. Let me know if it helps!

rosaranli commented 2 years ago

Thank you very much and I am sorry for late reply. What I did was I just ran chromosome by chromosome so it would reduced the memory used. I'll definitely try the Release 1.2.0 to check if that makes a difference.