ebi-gene-expression-group / atlas-web-bulk

Bulk Expression Atlas web application
Apache License 2.0
0 stars 1 forks source link

Correct the mapping on external resources links in Supplementary Information page #195

Closed lingyun1010 closed 1 month ago

lingyun1010 commented 4 months ago

We have some conflicts mapping issues in bulk Supplementary Information page, regarding experiment type and ArrayExpress.

In the case of E-PROT-39, as its experiment type is RNASEQ_MRNA_DIFFERENTIAL so the external resources are grouped to ENA and it also contains ArrayExpress link which is invalid either.

https://www.ebi.ac.uk/gxa/experiments/E-PROT-39/Supplementary%20Information

sfexova commented 4 months ago

as agreed on Slack, the accession-to-link resolution should be made independent of experiment type and rely on just the accession style itself here are the accession to resource mappings:

ArrayExpress accessions E-MTAB<> -> ArrayExpress E-ERAD<> -> ArrayExpress E-GEUV<> -> ArrayExpress

Proteome Exchange accessions - can be viewed in PRIDE (and elsewhere) PDX<> -> PRIDE

GEO accessions GSE<> -> GEO GDS<> -> GEO

INSDC consortium project accessions - can be viewed in ENA (and elsewhere) ERP<> -> ENA SRP<> -> ENA DRP<> -> ENA

BioProject NSDC consortium accessions - can be viewed in ENA (and elsewhere) PRJEB<> -> ENA PRJNA<> -> ENA PRJDB<> -> ENA

EGA accessions EGAS<> -> EGA EGAD<> -> EGA

Some E-HCAD experiments (so these would be in SCEA only, not bulk) may have a 'bundle ID' in the secondary accession field in idf but I am not sure if that could be used to search and point to a project in the HCA Data portal

sfexova commented 2 months ago

I've added EGA accession mapping to the list above. Following discussions on Slack and during sprint mtg I suggest to dump the existing display hierarchy as it could accidentally remove valid multiple entries (e.g. for some CURD datasets where more than 1 experiment has been combined into one) and instead display all sources by default. The logic to check for truly synonymous entries may be quite complicated and not worth the effort right now I believe. If we discover cases where displaying all creates problems for users we can reevaluate.

lingyun1010 commented 2 months ago

Hi @sfexova, I have implemented the EGA, ENA and GEO resource links, but for ArrayExpress, it's a bit different, for example, experiment E-MTAB-1913, in the idf file, there is only one secondaryAccessionwhich is ERP003983 pointing to ENA but there is no secondary accessions pointing to ArrayExpress except for the experiment accession itself.

So does that mean that ArrayExpress should look by the experiment accession or the secondary accession or both?

sfexova commented 2 months ago

ah, good point!! yes, for experiments from ArrayExpress it needs to be a bit different - for experiments with the ArrayExpress accession E-MTAB-XX we should look at the experiment accession only and ignore the [secondary accession] pointing to ENA because there we know they are synonymous

lingyun1010 commented 2 months ago

ah, good point!! yes, for experiments from ArrayExpress it needs to be a bit different - for experiments with the ArrayExpress accession E-MTAB-XX we should look at the experiment accession only and ignore the [secondary accession] pointing to ENA because there we know they are synonymous

Okay, thanks for the clarification, and how about the others?

E-ERAD<> -> ArrayExpress
E-GEUV<> -> ArrayExpress

Are these the experiment accession or [secondary accession] ? Thanks.

sfexova commented 2 months ago

yes, same rules for E-ERAD and E-GEUV as for the E-MTAB AE accessions > for these, ignore [secondary accession] and use experiment accession to link to ArrayExpress the mapping rules above were all meant for the [secondary accession] - for cases when these different accession codes appear in the [secondary accession] field in the idf