Mayrlab / scUTRquant

Bioinformatics pipeline for single-cell 3' UTR isoform quantification
https://Mayrlab.github.io/scUTRquant
GNU General Public License v3.0
14 stars 3 forks source link

improve default row annotations #46

Closed mfansler closed 1 year ago

mfansler commented 1 year ago

The row annotations on the transcript-level SCE objects do not include the gene ID. The gene-level SCE has rowRanges, but apparently because the rowRanges is a GRangesList, the metadata columns do not even appear as rowData, which is problematic.

This should be improved to include more information from the GTF file. The two should also be consistent in what is added as rowRanges. E.g., if gene-level uses transcripts split on gene IDs, then transcript-level should use exons split on transcript IDs.

Proposed additions transcript-level:

Both:


Current Outputs

Transcript-Level SCE

> sce <- readRDS("data/sce/utrome_mm10_v2/heart_1k_v2_fastq.txs.Rds")
> rowData(sce)
DataFrame with 44467 rows and 1 column
                                       transcript_id
                                         <character>
ENSMUST00000000001.4            ENSMUST00000000001.4
ENSMUST00000000001.4-UTR-1494 ENSMUST00000000001.4..
ENSMUST00000000001.4-UTR-884  ENSMUST00000000001.4..
ENSMUST00000000003.13          ENSMUST00000000003.13
ENSMUST00000000010.8            ENSMUST00000000010.8
...                                              ...
ENSMUST00000239485.1            ENSMUST00000239485.1
ENSMUST00000239489.1            ENSMUST00000239489.1
ENSMUST00000239492.1            ENSMUST00000239492.1
ENSMUST00000239495.1            ENSMUST00000239495.1
ENSMUST00000239498.1            ENSMUST00000239498.1
> rowRanges(sce)
GRanges object with 44467 ranges and 1 metadata column:
                                seqnames              ranges strand |          transcript_id
                                   <Rle>           <IRanges>  <Rle> |            <character>
           ENSMUST00000000001.4     chr3 108107280-108107779      - |   ENSMUST00000000001.4
  ENSMUST00000000001.4-UTR-1494     chr3 108108774-108109273      - | ENSMUST00000000001.4..
   ENSMUST00000000001.4-UTR-884     chr3 108108164-108108663      - | ENSMUST00000000001.4..
          ENSMUST00000000003.13     chrX   77837901-77845039      - |  ENSMUST00000000003.13
           ENSMUST00000000010.8    chr11   96276096-96276595      + |   ENSMUST00000000010.8
                            ...      ...                 ...    ... .                    ...
           ENSMUST00000239485.1     chr5   64953106-64953605      - |   ENSMUST00000239485.1
           ENSMUST00000239489.1    chr15   85900412-85901109      - |   ENSMUST00000239489.1
           ENSMUST00000239492.1     chr8   69373528-69373914      - |   ENSMUST00000239492.1
           ENSMUST00000239495.1     chr7 144421092-144421591      + |   ENSMUST00000239495.1
           ENSMUST00000239498.1     chr2   25766115-25768099      + |   ENSMUST00000239498.1
  -------
  seqinfo: 239 sequences from mm10 genome

Gene-Level SCE

> sce <- readRDS("data/sce/utrome_mm10_v2/heart_1k_v2_fastq.genes.Rds")
> rowData(sce)
DataFrame with 21658 rows and 0 columns
> rowRanges(sce)
GRangesList object of length 21658:
$ENSMUSG00000000001.4
GRanges object with 4 ranges and 2 metadata columns:
                                seqnames              ranges strand |          transcript_id              gene_id
                                   <Rle>           <IRanges>  <Rle> |            <character>          <character>
           ENSMUST00000000001.4     chr3 108107280-108107779      - |   ENSMUST00000000001.4 ENSMUSG00000000001.4
   ENSMUST00000000001.4-UTR-125     chr3 108107405-108107904      - | ENSMUST00000000001.4.. ENSMUSG00000000001.4
   ENSMUST00000000001.4-UTR-884     chr3 108108164-108108663      - | ENSMUST00000000001.4.. ENSMUSG00000000001.4
  ENSMUST00000000001.4-UTR-1494     chr3 108108774-108109273      - | ENSMUST00000000001.4.. ENSMUSG00000000001.4
  -------
  seqinfo: 239 sequences from mm10 genome

$ENSMUSG00000000003.15
GRanges object with 2 ranges and 2 metadata columns:
                        seqnames            ranges strand |         transcript_id               gene_id
                           <Rle>         <IRanges>  <Rle> |           <character>           <character>
  ENSMUST00000000003.13     chrX 77837901-77845039      - | ENSMUST00000000003.13 ENSMUSG00000000003.15
   ENSMUST00000114041.2     chrX 77837902-77848039      - |  ENSMUST00000114041.2 ENSMUSG00000000003.15
  -------
  seqinfo: 239 sequences from mm10 genome

$ENSMUSG00000020875.9
GRanges object with 2 ranges and 2 metadata columns:
                                seqnames            ranges strand |          transcript_id              gene_id
                                   <Rle>         <IRanges>  <Rle> |            <character>          <character>
           ENSMUST00000000010.8    chr11 96276096-96276595      + |   ENSMUST00000000010.8 ENSMUSG00000020875.9
  ENSMUST00000000010.8-UTR+3573    chr11 96279669-96280168      + | ENSMUST00000000010.8.. ENSMUSG00000020875.9
  -------
  seqinfo: 239 sequences from mm10 genome

...
<21655 more elements>