codingdavinci / relaunch2018

This is the new Coding da Vinci website (online since September 2020).
https://codingdavinci.de
GNU General Public License v2.0
1 stars 1 forks source link

Port archives of current page #68

Closed Snater closed 5 years ago

Snater commented 5 years ago

I didn't spot that in the brief, but I am sure we talked about it in the kick-off. (Not sure which milestone that belongs to, though.) It should be feasible to create a parser for the archive content of the current page (projects, data sets) and run it against a custom import script programmatically creating the necessary content type instances. At least most relevant information should be portable. In that process, we could also have the old archive URLs created and mapped to the new page content, although I am not sure yet what's the best way to deal with that. On the current page, projects have their individual html pages. These URLs could be added to Drupal path aliases. However, data sets are just anchor links and could not be added to Drupal path aliases, so it would probably require some pretty sophisticated mechanism. Porting events should be done manually, because the content structure is completely different in the new system.

lucyWMDE commented 5 years ago

Mapping the archive project urls is really important. Let me know if I can support. For datasets.. if we can rebuild the internal links from projects to datasets they were based on that would be great (I'd really rather not have to go through all that manually...). Mapping the links in general is otherwise not so important (I don't imagine many external users would have figured out the anchor links). However, I guess the tricky part is there already if we do the former... let me know how tricky it is. If it's really a stupid task then we may as well do the stupid task of re-linking the datasets from the project submission forms manually.

Contents...

Snater commented 5 years ago

I have started with the data sets by extracting information from the HTML file. Unfortunately, there are a few inconsistencies, most prominently:

The data I have extracted:

I guess that's all information that can be extracted. I haven't gone forward on actually importing yet as I am not sure how to map all that information:

lucyWMDE commented 5 years ago

So. I think the category tag was an idea that never really got realised. There's nothing in the front-end that uses that information... and the quality of that information definitely degenerated, though I'm not sure if it was ever really good - most things eem to be labelled "Flora & Fauna". No need to import that data.

Snater commented 5 years ago

OK, that's all fine.

That extra field looks like the most practical solution. (Technically, I would probably not add another content type, but just add a field to the data set content type and have it show up just for admins. That field could be removed as soon as manual processing is all done.) An alternative I thought of was dumping all info in a structured way (e.g. csv) into some file or just print it for copying into a file. Then the info could be cut-&-copied out of that file. Transferring info would be more comfortable that way, but you'd always need to find the info corresponding to the data set being reviewed. While with a field on the content type, you have that information at hand instantly.

Adding/copying the links manually is not a nice job. I could just add all of a data set's links to all three link fields. Removing a link from a wrong field is just one click and would probably be a lot more practical than copy & paste URL and copy & paste link title for each link.

I will have a look at the licenses. My idea is to look if "meta" appears in the line of the license. If so, I'd add that line's license for meta data. If there is only one license, I'd apply it to both. Anyhow, there are edge cases which make manual processing mandatory.

<dd class="data-license">
  Abhängig vom Objekt: <br/>
  Medienobjekte und Orte <span class="label label-danger">CC0</span><br/>
  Geschichten <span class="label label-danger">CC-BY-SA</span><br/>
</dd>
<dd class="data-license">
   <!-- span class="label label-danger">[Lizenz]</span -->
   siehe <a  target=_blank href="https://www.kulturarv.dk/fbb/omfbb.htm">Hinweise zur Nutzung</a>
</dd>
<dd class="data-license">
  Metadaten: <span class="label label-danger"></span></br>
  Werke: i.d.R. liegen die Rechte bei den Künstlern
</dd>

I am going to implement a "legacy" field dumping all data set info in JSON format into it, apart from filling all fields as good as possible on a general level. :)

Snater commented 5 years ago

Here's an export example in JSON. The value of the key __raw_meta would end up unescaped in that legacy field. (I hope that all the general information – institution name, data set title, image and description – imports properly.) In the raw data, left the plain <span> tags in place, because they act as logical separator in some cases. In this example, the script detected the meta data and media licenses. However, inserting licenses will fail in quite some cases. Sometimes, no CC license version is named – like in this case. That cannot be imported with our current strict setup. Also, sometimes the version number is within the <span>, sometimes outside and with additional text in the same line that is a bit complex to parse. Apart from that, I think the result is quite OK. I noticed another special case on this particular example: A link to a local file.

[{
  "name": "Botanischer Garten und Botanisches Museum Berlin",
  "dataSets": [{
    "__raw_meta": {
      "Lizenz": [
        "Bilder: <span>CC-BY-SA<\/span>",
        "Metadaten: <span>CC0<\/span>"
      ],
      "Dateityp": [
        "<span>JPEG<\/span><span>CSV<\/span><span>XML<\/span>"
      ],
      "Jahr": [
        "<span>Berlin 2017<\/span>"
      ],
      "Links": [
        "<a href=\"http:\/\/ww3.bgbm.org\/biocase\/downloads\/Schweinfurth\/Collection%20of%20botanical%20drawings%20by%20Georg%20Schweinfurth%20at%20B.DwCA.zip\">Metadaten<\/a>",
        "<a href=\"\/downloads\/daten-2017\/1245_P_BGBM_Schwenfurth_Zeichnungen.pdf\">Datenpr\u00e4sentation<\/a>"
      ]
    },
    "image": {
      "src": "{{ site.baseurl }}img\/daten\/bgbm-georg.jpg",
      "alt": "Georg Schweinfurths Sammlung botanischer Zeichnungen"
    },
    "title": "Georg Schweinfurths Sammlung botanischer Zeichnungen",
    "description": [
      "<p>Georg Schweinfurth, 1836\u20131925, war in erster Linie Botaniker und Pflanzengeograf, forschte und publizierte aber auch ausgiebig in Geografie, Ethnologie, Anthropologie und \u00c4gyptologie. Dabei hielt er seine Beobachtungen nicht nur in anschaulichen Texten fest, sondern vermochte als hervorragender Zeichner die Forschungsobjekte seiner ungew\u00f6hnlich breiten wissenschaftlichen Bet\u00e4tigung ebenso ansprechend wie instruktiv abzubilden. Schon zu Lebzeiten \u00fcberlie\u00df Schweinfurth dem Botanischen Garten und Botanischen Museum Berlin-Dahlem seine in vier Foliob\u00e4nden gebundene Sammlung botanischer Zeichnungen, die w\u00e4hrend des 2. Weltkrieges zusammen mit kostbaren alten Abbildungswerken der Bibliothek und besonders wertvollen Teilen des Herbars in Stollen der Kali-Werke Bleicherode-Ost bei Nordhausen ausgelagert wurden und so die Vernichtung von Herbar und Bibliothek durch Bombentreffer im M\u00e4rz 1943 \u00fcberstanden. In den vergangenen Jahren wurden die 624 erhaltenen Bl\u00e4tter, viele davon in fragilem Zustand, inventarisiert, digitalisiert, wissenschaftlich bearbeitet und im Web zug\u00e4nglich gemacht.<\/p>"
    ],
    "Lizenz": {
      "media": "CC-BY-SA",
      "meta": "CC0"
    },
    "Dateityp": [
      "JPEG",
      "CSV",
      "XML"
    ],
    "Jahr": [
      "Berlin 2017"
    ],
    "Links": [
      {
        "url": "http:\/\/ww3.bgbm.org\/biocase\/downloads\/Schweinfurth\/Collection%20of%20botanical%20drawings%20by%20Georg%20Schweinfurth%20at%20B.DwCA.zip",
        "title": "Metadaten"
      },
      {
        "url": "\/downloads\/daten-2017\/1245_P_BGBM_Schwenfurth_Zeichnungen.pdf",
        "title": "Datenpr\u00e4sentation"
      }
    ]
  }, <...>]
}, <...>]
Snater commented 5 years ago

I have created the import script for the projects. It seems like all information can be imported, except the references to the data sets used by each project. URL aliases are created as well.

As for the data sets, there is a legacy field now receiving all meta data specified in the current page. This field is not accessible/visible for institution users.

Technically, I am done on this issue. I'd just need to run the import at a point in time to be decided. The import should only be run once because there is no tracking on imported entities. If the import fails or some information is missing, all imported entities would need to be deleted before rerunning the import. Deleting is a quick tast, but once everything is imported and imported data sets have been edited, having to rerun the import would mean all those changes would be lost. So it's important to check right after importing whether the import results are not broken in whatever way.

Running the import only once means that, as long as the current site is still online, data sets and projects would need to be maintained in both sites in parallel after the import has run. So the time between importing and putting the new site online should be kept as short as possible.

Snater commented 5 years ago

The general import is technically done, so I guess we can close this issue. If there are still problems resulting from the import or if there should be tickets for manually processing the imported data, we should open new issues.