Example script to process the output files

zhanglab2008 commented 2 years ago

Hi Mervin, I have succeeded in getting the output files from scUTRquant but I have no idea if the files are in good shape. For example, when I read the _heart_10k_v3fastq.genes.Rds in R, the gene names are like this:

   [1] "ENSMUSG00000000001.4"  "ENSMUSG00000000003.15" "ENSMUSG00000000028.15" "ENSMUSG00000048583.16"
   [5] "ENSMUSG00000000049.11" "ENSMUSG00000000058.6"  "ENSMUSG00000000078.7"  "ENSMUSG00000000085.16"
   [9] "ENSMUSG00000000088.7"  "ENSMUSG00000061689.15" "ENSMUSG00000000093.6"  "ENSMUSG00000000094.12"

and here are the transcript names from the _heart_10k_v3fastq.txs.Rds:

  [1] "ENSMUST00000000001.4"  "ENSMUST00000000003.13" "ENSMUST00000000028.13" "ENSMUST00000000033.11"
   [5] "ENSMUST00000000049.5"  "ENSMUST00000000058.6"  "ENSMUST00000000080.7"  "ENSMUST00000000087.12"
   [9] "ENSMUST00000000090.7"  "ENSMUST00000000094.13" "ENSMUST00000000095.6"  "ENSMUST00000000096.11"

Have you by any changes made a tutorial about how to process the output files? In particular, I want to know how to get the 3'UTR information from the output files. I noticed that you have made another repo called "scUTRquant-figures" to reproduce the figures in your biorxiv paper but I have difficulty in reading the codes since the codes are not starting from the scUTRquant output files. It would be really helpful if you can make a tutorial for that. Many thanks for developing this tool again! Zhang

mfansler commented 2 years ago

@zhanglab2008 Great to hear it finished! Those sound like correct rownames for resulting files. I appreciate all the feedback! 🙏

As for tutorial, yes, eventually this will be added as a vignette to the scUTRboot R package (which is more a work in progress). In the meantime, I'd direct you to the processing we did in Figure 5, for the 451Lu dataset where we used a GENCODE (human) target. It should be straightforward swapping in your data files.

The preprocessing and testing is performed in fig5d-451lu-pairwise-tests.Rmd. And the plotting is in fig5d-451lu-pairwise-plots.Rds.

Adjust the QC filtering to your needs, e.g., MT percent thresholds depend on cell type. The entropy/diversity filtering is optional. In our examples, we never call cell types ourselves, but just use the cell types inferred from gene-level methods. Did you provide cell_annots to include any cluster/cell type information about the cells?

Note that we don't provide functions yet for annotating custom targets with things like shortest and longest, or distinguish between intronic and tandem 3' UTRs (maybe we should include that in this pipeline for when users don't provide their own tx or gene annots?). However, the example I mentioned does provide a rudimentary annotation of first and last UTR using the genomic position of the transcripts' 3'-ends. In that file, we do a LUI test, but because we didn't distinguish intronic sites, many of the significant hits are actually intronic polyadenylation events. Not sure if that matters for you.

Also, I'd suggested running the WD test mode, since that will test any changes. However, it doesn't give any directional information - only total magnitude of isoform changes.

As for the gene results, one can process them like any other scRNA-seq dataset.

Requirements The knitted version also has a Conda environment YAML output that you could use to recreate the environment on MacOS.

zhanglab2008 commented 2 years ago

Hi Mervin, Excellent! I will try the code to generate your figure 5. Yes, I included cell_annots when I ran scUTRquant. I actually ran it using a barcode whitelist with the cells that have been passed the QC from Seurat (based on the cellranger counts). You made a really good point about the intronic polyadenylation events. I am not sure how this may affect my data but I think I need to get the 3'UTR data out first and then see if the results make sense or not. I will get back to you if any issues pop up. Really great work!

Mayrlab / scUTRquant

Example script to process the output files #34