Mayrlab / scUTRquant

Bioinformatics pipeline for single-cell 3' UTR isoform quantification
https://Mayrlab.github.io/scUTRquant
GNU General Public License v3.0
14 stars 3 forks source link

Custom 3'utr #72

Open aleighbrown opened 3 months ago

aleighbrown commented 3 months ago

Hello,

I have a set of custom novel 3'UTRs that I would like to quantify in single-cell data.

Ideally I would just want to quantify the 100 or so 3'UTRs that I'm interested in for speed's sake

What would I need to build a minimum working kallisto index of UTRome, GTF, and TSV merge annotation for my custom set of 3'UTRs?

thank you in advance!

mfansler commented 3 months ago

Thanks for the interest!

Yes, you could build a custom target that only quantifies reads in the regions of interest. To plug into this pipeline, you would indeed provide a kallisto index ("kdx"), GTF, and TSV merge annotation. You would edit the extdata/targets/targets.yaml to add this information, something like:

custom_utrs:
  path: "extdata/targets/custom_utrs/"
  genome: "hg38"
  gtf: "custom_utrs.gtf"
  kdx: "custom_utrs.kdx"
  merge_tsv: "custom_utrs.merge.tsv"
  tx_annots: null
  gene_annots: null
  download_script: null

and the path would be relative to the root of the repository (absolute is also fine).

Caveats

I'll just note some caveats about taking this approach as opposed to adding the custom 3'UTRs to the full annotation.

Identifying Cells: Valid cell barcodes would need to come from previous data processing. Otherwise, the targeted regions alone may not be sufficient to discriminate high-quality cells from low-quality cells or background.

Comparing Across Cells or Samples: Normalization (size) factors would need to come from previous data processing. With only targeted regions, it would be unclear whether higher counts were due to higher expression, higher capture rate, deeper sequencing, or some mixture.

Multimapping Reads: Reads that would multimap in a full annotation might uniquely map in a targeted subset, leading to overestimation of counts. One should prove this isn't a factor before trusting the targeted results. You'd probably want to prepare a full index (full UTRome + custom novel 3'UTRs) and then inspect if any of the kmers from the targeted regions are shared with those in non-targeted regions. If they do, you may need to include the other transcripts that have shared k-mers to make the assignment fair. That is, one doesn't want changes in gene expression from some other gene to show up as isoform-specific expression in a targeted isoform due to excluding the alternative loci whence the reads may have originated.

On the last point, you may also just do some empirical spot checks. For example, run some samples with the full UTRome + novel 3'UTRs and separately with just the targeted index, then compare the results. That should surface multimapping issues if the counts do not come out identically.

Hope that's helpful! Let me know if I can answer any more questions.