geneontology / archive-reconstruction

Codes to move various legacy files to the current release.geneontology.org
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Reconstruction from SVN (+ decide of a proper remapping of files) #1

Closed lpalbou closed 3 years ago

lpalbou commented 4 years ago

The remapping of the SVN repo is handled by a simple mapping file. It allows:

@kltm @pgaudet I have created a default mapping.txt which created this archive: https://geneontology-tmp.s3.amazonaws.com/index.html#releases/

By editing further the mapping file, we can remap to specific filenames that would be more consistent with the current.geneontology.org (mostly the GAFs)

kltm commented 4 years ago

@lpalbou Lovely.

As mentioned in a thread, there is a bit of "mapping" that we (unfortunately) do for filenames in the pipeline:

https://github.com/geneontology/pipeline/blob/dac947b73cfc9cfd39c832ac0699def334522bde/Jenkinsfile#L1332-L1370

This was added as a late sop to legacy SVN users (now almost all gone) as we migrated away from SVN. This should be disappearing in fairly short order after the legacy work is done. It likely makes sense to have the canonical/historical version of this exist in your code/metdata, rather than awkwardly embedded in the pipeline file.

lpalbou commented 4 years ago

I updated to a V2 that should take care of the file mapping: https://github.com/geneontology/archive-reconstruction/commit/fb5df41468c5469238c885b69ac4b6f6f56df879 .

The example releases generated for this V2 are here: https://geneontology-tmp.s3.amazonaws.com/index.html#releases-2/ (should finish upload in a few hours). They use current annotation filenames.

Notes:

pgaudet commented 4 years ago

Are you saying you will make the names of the files match what we currently have ? For example

Or are we trying t keep the legacy names ?

On that topic, are the names of the various folders fixed ? for example, ideally, I dont think we should have two 'annotation' folders, regardless of the fact that they are in different parent folders.

pgaudet commented 4 years ago

Hi @lpalbou Another question, I see that your nice interface has the structure '/releases/2016-08-01/[annotations/ontology/products]' are we keeping this, regardless of the fact that there were no 'releases' before the current pipeline ?

pgaudet commented 4 years ago

@lpalbou Looks like in your browser, only the source files are present in the/products folder.

  1. The files should be renamed consistently with the current naming scheme (ie aspgd-src.gaf.gz, for eg)
  2. Can we change the folder names ?? and ideally the contents (I think that messes up the zenodo archive but we could add a README to explain ??)

@kltm I suspect you're not going to like those suggestions, but it feels like this would be a good opportunity to clarify all our files.

@cmungall Thanks, Pascale

lpalbou commented 4 years ago

@pgaudet we are not keeping the legacy names and trying to remap to current names.

The link you provided for the S3 was the first initial attempt (I am gonna update the ticket with the new URL, also on the main README). Please check instead: https://geneontology-tmp.s3.amazonaws.com/index.html#releases-full/2016-08-01/annotations/

On that topic, are the names of the various folders fixed ?

Nothing is fixed, I am just remapping to what we currently have for consistency but as you know I am not thrilled either of the current folder hierarchy.

lpalbou commented 4 years ago

Another question, I see that your nice interface has the structure '/releases/2016-08-01/[annotations/ontology/products]' are we keeping this, regardless of the fact that there were no 'releases' before the current pipeline ?

I would say yes, so that users could refer to that specific version of GO ? But if you prefer, we could also have '/archive/2016-08-01/[annotations/ontology/products]' . Probably more correct but it would complicate slightly the reuse by bioinformatician.

lpalbou commented 4 years ago

Can we change the folder names ?? and ideally the contents

For the archive, it's easy, we just have to edit the mapping file: https://github.com/geneontology/archive-reconstruction/blob/master/mapping.txt

To clarify, are you proposing that remapping only for the archive or for both the archive and our current releases ? I am guessing the later. I don't like either the current folder hierarchy we have so I think it would be great to make it more intuitive; at the same time, we have about 2 years of Zenodo archive with that format so we would have to discuss if and how we want to deal with that.