CharlesJB / ENCODExplorer

5 stars 4 forks source link

error in metadata table #53

Closed vjcitn closed 4 years ago

vjcitn commented 4 years ago

We'll show a confusing situation with organism and assembly for elements of the full 2019-10-13 metadata build.

ah = AnnotationHub()
query(ah, "ENCODExplorerData")
## AnnotationHub with 4 records
## # snapshotDate(): 2020-02-28
## # $dataprovider: ENCODE Project
## # $species: NA
## # $rdataclass: data.table
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["AH69290"]]' 
## 
##             title                                         
##   AH69290 | ENCODE File Metadata (Light, 2019-04-12 build)
##   AH69291 | ENCODE File Metadata (Full, 2019-04-12 build) 
##   AH75131 | ENCODE File Metadata (Light, 2019-10-13 build)
##   AH75132 | ENCODE File Metadata (Full, 2019-10-13 build)
fm = ah[["AH75132"]]

This is a data.table. We tabulate the assemblies in use for experiments of organism Homo sapiens.

> table(fm$assembly[fm$organism=="Homo sapiens"], fm$organism[fm$organism=="Homo sapiens"])

                 Homo sapiens
  GRCh38               129473
  GRCh38-minimal            4
  hg19                 162657
  mm10                    133
  mm10-minimal            270
  mm9                       2

Probably the organism is simply mislabeled, and the assembly annotation is more reliable. But is this an error upstream in ENCODE metadata or is it a curation problem in this package? Thank you.

ericfournier2 commented 4 years ago

Thank you for reporting this. It seems the organism is mislabeled in EncodeExplorer in at least some of these cases, since the ENCODE website reports the organism correctly. I'll look into the issue in more depth later on, but for now you should probably rely on the genome build.

ericfournier2 commented 4 years ago

I have regenerated a new build of the metadata, and there are no assembly/organism mismatches present. I haven't determined for sure if the issue stemed from incorrect metadata on ENCODE's end, but since there's been no real code-changes in the past six months, it seems the most likely cause.

This should go away when we deploy the new build with the next BioC release. In the meantime, if you need corrected metadata, you can generate it using the ENCODExplorerData package.

vjcitn commented 4 years ago

Thank you!