Piscem performance with bulk RNA datasets

JosephLalli commented 6 months ago

Hi there,

The alvinfry-piscem paper understandably focuses on single cell sequencing performance, and finds comparable results between piscem-alvinfry and salmon-alvenfry. Based on this result, it's reasonable to assume that piscem and salmon would also produce comparable results when aligning and quantifying bulk-rna reads.

However, has the performance of the two tools been tested in a bulk RNA context? To be frank, I skimmed your alvinfry-piscem paper, so I apologize if this comparison was done in the supplemental figures.

Best, Joe

rob-p commented 6 months ago

Hi @JosephLalli,

Thanks for reaching out. Indeed, that paper only discusses piscem in the context of the piscem -> alevin-fry pipeline, which is entirely focused on single-cell analysis. In the context of bulk sequencing, the pipeline would be piscem -> piscem-infer.

We do not have published benchmarking of that pipeline right now. However, we do have a manuscript in the works that describes piscem in more detail and which will include a description and some benchmarking of piscem-infer. That said, we have done internal testing, and find that the pisecm -> piscem-infer pipeline generally does well in the bulk context, at least in the places where feature parity with salmon exists (salmon has more features than piscem-infer, some of which will be ported, and some of which likely won't). Anyway, if there's any specific data you'd like to test on, we'd be happy to be involved in / help out with any benchmarking.

Best, Rob

JosephLalli commented 6 months ago

I have a dataset of 235 ancestrally diverse samples with paired bulk RNA and DNA short-read sequencing. I previously encountered issues with highly variable pseudogene expression leading to clearly erroneous eQTL hits.

To address this problem, and to increase QTL detection power, I am benchmarking the effect of different methods of alignment on eQTL results. I have pangenie SV calls in GRCh38 and T2T coordinates for all samples, as well as variant calls using the following combinations of tools:

- bwamem-Haplotypecaller (GRCh38 or T2T reference)
- bwamem-deepvariant  (GRCh38 or T2T reference)
- vg giraffe-deepvariant  (GRCh38 or T2T based reference graph)

I intend on phasing the best performing GRCh38 and T2T variant call set, then use a nextflow pipeline I have written to obtain RNA-seq counts by mapping to GRCh38, T2T, or a personalized transcriptome in both reference coordinates generated from the phased variant calls. This requires two separate Salmon index files for each individual, which quickly takes up many terabytes of HDD space. The smaller index size of piscem is thus very interesting!

I don't know that I have the time to do proper benchmarking between piscem and salmon (hence my hoping that you had already done it :) but if you have any interest in this dataset I'd be happy to collaborate. My email address is lalli@wisc.edu.

COMBINE-lab / piscem

Piscem performance with bulk RNA datasets #20