bio-tools / biotoolsRegistry

biotoolsregistry : discovery portal for bioinformatics
GNU General Public License v3.0
69 stars 18 forks source link

Systematic check that all "collection" tags satisfy the new regex #285

Closed joncison closed 6 years ago

joncison commented 6 years ago

Same deal as in https://github.com/bio-tools/biotoolsregistry/issues/284, there's a change in the regex to support (the eventual) use of collection IDs in semantic web applications.

Old pattern: [A-Za-z0-9-~.]+ New pattern: [a-zA-Z][-.0-9a-zA-Z]*

Again, collection tags must now start with underscore or a letter.

@hansioan : will you pls. check ASAP that our existing collection tags satisfy the new pattern - and confirm here?

cc @matuskalas

hansioan commented 6 years ago

They don't all match the new pattern because many collections contain a space. The ones that are currently in bio.tool and don't match: http://cbs.dtu.dk/services g:Profiler toolkit g:Profiler Rostlab tools Czech Republic Masaryk University SHOW - Structured HOmogeneities Watcher MOdels for Data Analysis and Learning - MODAL Bologna Biocomputing Group MoD Tools EBI Tools ChEBI Tools ChEMBL Tools UniProt Tools Ensembl Tools EMBOSS at EBI Tools PDBe Tools EBI Tools (ENA Tools) Thornton Tools Europe PMC Tools Parkinson Tools Goldman Tools Plant Systems Biology BIG N2N Tel Aviv University BioMedBridges Tools Instruct CCP4 Cell Line Integrated Molecular Authentication database and Identification tool USMI Cell Line Database and Analisys Tools USMI Biological Resources Catalogues Bromberglab tools RostLab tools, PredictProtein Odonoghuelab tools http://galaxyapi.web.pasteur.fr EMBOSS_6.3.1 hmmer_3.0 mview1.49 CBS phylip_3.67 pdb-lib_1.0 blastTaxoAnalysis_1.0 njplot_20051109 ViennaRNA_1.8.4 newick-utils_1.6 Clustal-Omega_1.1.0 blast_2.2.26 ClustalW_2.0.12 taxoptimizer_1.1 squizz_0.99b UiO tools KMUTT tools CBU tools UiB tools BiB tools Debian Med NTNU tools Regulatory Sequence Analysis Tools (RSAT) Segway Suite Institut Pasteur Bioinformatics and Biostatistics Hub GEM Pasteur Bioinformatics and Biostatistics Hub Pasteur Structural Mass Spectrometry and Proteomics EBI Training Tools GO Tools ELIXIR Trainer Tools Rare Disease http://www.pubmedcentral.gov/ Medizinisches Proteom-Center LCC NCBR Animal and Crop Genomics micro-computed tomography

joncison commented 6 years ago

Hmm, so we'd need two elements, similar to how we handle name/ID currently

We can revert / refactor biotoolsSchema accordingly - we don't want to be introducing "_" into the collection tags, when rendered

matuskalas commented 6 years ago

N.B.

The old pattern for collection IDs was [A-Za-z0-9_\- _ ~]+ The new pattern is [_a-zA-Z][_\-.0-9a-zA-Z]*

So the following should match the new pattern, although they are way disputable collections. Except _CBS__, they look like either tools or toolkits (see also my mention of toolkits later down): EMBOSS_6.3.1 hmmer_3.0 mview1.49 CBS phylip_3.67 pdb-lib_1.0 blastTaxoAnalysis_1.0 njplot_20051109 ViennaRNA_1.8.4 newick-utils_1.6 Clustal-Omega_1.1.0 blast_2.2.26 ClustalW_2.0.12 taxoptimizer_1.1 squizz_0.99b

@hansioan, if you wouldn't mind, could you please re-generate your list once more with the actual new pattern?

URLs as collectionIDs

These are total nonsense as either collectionID or collectionName, and need to be replaced with what they actually point to | stand for. Luckily only 3 of those exist.

http://www.pubmedcentral.gov/ http://galaxyapi.web.pasteur.fr http://cbs.dtu.dk/services

Space

Some options for handling the space (as I mentioned also in https://github.com/bio-tools/biotoolsSchema/issues/94):

  1. Replacing spaces with underscores and keeping going. (Simplest)
  2. Adding a collectionName attribute, as mentioned right above by @joncison. (N.B. that here it isn't so trivial to keep the collectionID-collectionName pairs consistent without going in the direction of the following (option 3.))
  3. Maintaining bio.tools records for collections, with ID and name in the first place (option 2.), and possibly adding other useful attributes, such as contacts and homepage (see collection URLs above), or even more (documentation, download pages, publications - these make sense especially for toolkits, software suites, workbenches (these 3 may be bio.tools records proper rather than collections), but also providers of multiple Web services (CBS, EBI), etc. - CC @hmenager what do you think?)
joncison commented 6 years ago

We need to tackle developments in stages. The end-game is definitely 3. above, to manifest as "Contributor Cards" and/or "Collection Cards", this in the roadmap for next year (https://biotools.sifterapp.com/issues/432). Until then, the simplest of all options is just revert to accepting any tag for collectionID, but with the above duly noted.

On a related note, it would be very nice to support in bio.tools the relation element, which would then immediately allow us to relate, e.g.

I'll discuss this with @ekry and @hansioan tomorrow.

hansioan commented 6 years ago

@matuskalas @joncison The regenerated list of collectionIDs that don't match the new pattern is: http://cbs.dtu.dk/services g:Profiler toolkit g:Profiler Rostlab tools Czech Republic Masaryk University SHOW - Structured HOmogeneities Watcher MOdels for Data Analysis and Learning - MODAL Bologna Biocomputing Group MoD Tools EBI Tools ChEBI Tools ChEMBL Tools UniProt Tools Ensembl Tools EMBOSS at EBI Tools PDBe Tools EBI Tools (ENA Tools) Thornton Tools Europe PMC Tools Parkinson Tools Goldman Tools Plant Systems Biology BIG N2N Tel Aviv University BioMedBridges Tools Instruct CCP4 Cell Line Integrated Molecular Authentication database and Identification tool USMI Cell Line Database and Analisys Tools USMI Biological Resources Catalogues Bromberglab tools RostLab tools, PredictProtein Odonoghuelab tools http://galaxyapi.web.pasteur.fr UiO tools KMUTT tools CBU tools UiB tools BiB tools Debian Med NTNU tools Regulatory Sequence Analysis Tools (RSAT) Segway Suite Institut Pasteur Bioinformatics and Biostatistics Hub GEM Pasteur Bioinformatics and Biostatistics Hub Pasteur Structural Mass Spectrometry and Proteomics EBI Training Tools GO Tools ELIXIR Trainer Tools Rare Disease http://www.pubmedcentral.gov/ Medizinisches Proteom-Center LCC NCBR Animal and Crop Genomics micro-computed tomography

joncison commented 6 years ago

For now, we revert to simple tags (see https://github.com/bio-tools/biotoolsSchema/issues/79)