LTLA / scRNAseq

Clone of the Bioconductor repository for the scRNAseq package.
http://bioconductor.org/packages/devel/data/experiment/html/scRNAseq.html
24 stars 12 forks source link

symbol column dropped from rowData of SegerstolpePancreasData (devel) #45

Open lgeistlinger opened 8 months ago

lgeistlinger commented 8 months ago

In Bioc release:

> library(scRNAseq)
> sce.seger <- SegerstolpePancreasData()
> rowData(sce.seger)
DataFrame with 26179 rows and 2 columns
                symbol                 refseq
           <character>            <character>
SGIP1            SGIP1              NM_032291
AZIN2            AZIN2 NM_052998+NM_001293562
CLIC4            CLIC4              NM_013943
AGBL4            AGBL4              NM_032785
NECAP2          NECAP2 NM_001145277+NM_0011..
...                ...                    ...
KIR2DL4        KIR2DL4 NM_001080772+NM_0022..
KIR2DS3        KIR2DS3              NM_012313
KIR2DS2        KIR2DS2 NM_001291696+NM_0123..
BIVM-ERCC5  BIVM-ERCC5           NM_001204425
eGFP              eGFP                   eGFP

In Bioc devel:

> library(scRNAseq)
> sce.seger <- SegerstolpePancreasData()
> rowData(sce.seger)
DataFrame with 26179 rows and 1 column
                           refseq
                      <character>
SGIP1                   NM_032291
AZIN2      NM_052998+NM_001293562
CLIC4                   NM_013943
AGBL4                   NM_032785
NECAP2     NM_001145277+NM_0011..
...                           ...
KIR2DL4    NM_001080772+NM_0022..
KIR2DS3                 NM_012313
KIR2DS2    NM_001291696+NM_0123..
BIVM-ERCC5           NM_001204425
eGFP                         eGFP

I think this causes OSCA.advanced and OSCA.workflows to break in devel @PeteHaitch @alanocallaghan

LTLA commented 8 months ago

Hm. I think I must have deemed the row names to be redundant with the symbol column and removed the latter to reduce the file size. To avoid breaking stuff, I can dynamically add it back in for the SegerstolpePancreasData function; however, fetchDataset() will still return the sans-symbol version, so people loading the dataset directly from the files (i.e., not through the per-dataset getters) will get a slightly different version of the dataset.

FYI fetchDataset() is going to be the way forward as it (i) avoids the need for contributors to write a getter function and (ii) eliminates the involvement of dataset-specific logic that can't be easily replicated in other frameworks like Python or JS.

Is Segerstolpe the only one? FWIW you can set legacy=TRUE and it'll pull from ExperimentHub for now.

lgeistlinger commented 8 months ago

If that's the way forward we can also adapt the corresponding parts of the OSCA book to look up the symbols from the rownames. I can't tell you whether this also happens to other datasets at this point. But the breakage comes from looking up the symbol column for ID mapping purposes, and this can be replaced by providing the rownames instead then.

LTLA commented 8 months ago

Added back symbol in 2.19.4. Only for SegerstolpePancreasData, so fetchDataset will still be missing symbol.

alanocallaghan commented 8 months ago

Yeah seems sensible to just use the rownames for OSCA purposes moving forward

alanocallaghan commented 4 months ago

Think this is resolved now?