Port archives of current page

I didn't spot that in the brief, but I am sure we talked about it in the kick-off. (Not sure which milestone that belongs to, though.) It should be feasible to create a parser for the archive content of the current page (projects, data sets) and run it against a custom import script programmatically creating the necessary content type instances. At least most relevant information should be portable. In that process, we could also have the old archive URLs created and mapped to the new page content, although I am not sure yet what's the best way to deal with that. On the current page, projects have their individual html pages. These URLs could be added to Drupal path aliases. However, data sets are just anchor links and could not be added to Drupal path aliases, so it would probably require some pretty sophisticated mechanism. Porting events should be done manually, because the content structure is completely different in the new system.

Mapping the archive project urls is really important. Let me know if I can support. For datasets.. if we can rebuild the internal links from projects to datasets they were based on that would be great (I'd really rather not have to go through all that manually...). Mapping the links in general is otherwise not so important (I don't imagine many external users would have figured out the anchor links). However, I guess the tricky part is there already if we do the former... let me know how tricky it is. If it's really a stupid task then we may as well do the stupid task of re-linking the datasets from the project submission forms manually.

Contents...

PROJECTS - I hope it's relatively straight forward to port the contents since the individual projects are fairly well-structured individual files.
DATASETS - let me know if you need support breaking this stupid massive html file down somehow - I hope it can be automated, at least as much as possible.
EVENTS - sure - I imagine we do all this manually. I'll arrange for some content development support for this in the new year.

I have started with the data sets by extracting information from the HTML file. Unfortunately, there are a few inconsistencies, most prominently:

Categories are always commented out (though in different ways, but that is not mainly relevant). I could extract the categories by simply removing the comment marks before reading the document. Data sets without a category have just "[Kategorie]" as category placeholder in most cases, so that's easy to filter out. But I just noticed something like "Computerspielemuseum" being in the category "Flora & Fauna". What to do with the categories?

The data I have extracted:

institution name
institution data sets, each with:
- image (if there is one)
- data set title
- description
- attribution
- category
- license
- file types
- year (requires creating the corresponding Event nodes to create the reference)
- links incl. URL and link title

I guess that's all information that can be extracted. I haven't gone forward on actually importing yet as I am not sure how to map all that information:

institution name: We don't store the institution name on the data set any more, instead we use the reference to the data set node author. Should dummy users be created for the imported data sets? (A node's author could be reassigned to another user any time by an admin.) An alternative would be to create another field on the data set content type to capture that legacy information. It would just be excluded from the submission form.
License / attribution: Should that be applied to media files, meta data, or both?
links: It's a bit tricky to determine whether a link is an API, media files or meta data link, because link titles are only standardized to some extend (i.e. "Metadaten", "Bilder und Metadaten", "Liste der Objekte der Kollektion (OAI-PMH-XML)". Which field of the link fields should the links be put in?
By coincidence, I noticed the data set "Gemeinsame Normdatei", which seems to have a very individual structure by not having "Links", but a "Download GND" and a "weiterführende Links" data definition. Also, the license is: "CC0 1.0, einige geografische Normdaten enthalten unveränderte Koordinaten aus der Datenbank GeoNames, http://www.dnb.de/geschaeftsmodell" Such individual information is simply hard to detect and complex to handle. Not even sure where to put that additional note on the license.

So. I think the category tag was an idea that never really got realised. There's nothing in the front-end that uses that information... and the quality of that information definitely degenerated, though I'm not sure if it was ever really good - most things eem to be labelled "Flora & Fauna". No need to import that data.

institution name - ultimately yes I think it would make sense if we could hand over the keys to an account that includes their data for any institution who re-joins coding da vinci. They could then add new datasets or make edits to the old ones.
License - this could get messy... -- in some cases there's simply a license with no indication what it applies to. I suspect/hope that in most cases this is where the dataset only consists of metadata, so the license applies to that. But it might be more complicated -- In some cases there is also text in that field that describes how licenses apply to metadata and media files... It would be great if it's possible to sort some of this in the script. However, I fully expect that we will have to go through this by hand and clean it up.
Attribution should be straight forward at least as it has its own field.
links - I'm starting to think it might be wise to add a new temporary content type - a text field - and just throw everything in there. We'll have to parse it all by hand afterwards. Again, there will be some indication of which links are metadata and which are media files, but I think it's not very standardised, so I doubt you can automate it very easily.
the GND is hopefully a special case... I'm not sure I've seen anything like that for other datasets. I think we can still find a way to fit it into our structure.

OK, that's all fine.

That extra field looks like the most practical solution. (Technically, I would probably not add another content type, but just add a field to the data set content type and have it show up just for admins. That field could be removed as soon as manual processing is all done.) An alternative I thought of was dumping all info in a structured way (e.g. csv) into some file or just print it for copying into a file. Then the info could be cut-&-copied out of that file. Transferring info would be more comfortable that way, but you'd always need to find the info corresponding to the data set being reviewed. While with a field on the content type, you have that information at hand instantly.

Adding/copying the links manually is not a nice job. I could just add all of a data set's links to all three link fields. Removing a link from a wrong field is just one click and would probably be a lot more practical than copy & paste URL and copy & paste link title for each link.

I will have a look at the licenses. My idea is to look if "meta" appears in the line of the license. If so, I'd add that line's license for meta data. If there is only one license, I'd apply it to both. Anyhow, there are edge cases which make manual processing mandatory.

<dd class="data-license">
  Abhängig vom Objekt: <br/>
  Medienobjekte und Orte <span class="label label-danger">CC0</span><br/>
  Geschichten <span class="label label-danger">CC-BY-SA</span><br/>
</dd>

<dd class="data-license">
   <!-- span class="label label-danger">[Lizenz]</span -->
   siehe <a  target=_blank href="https://www.kulturarv.dk/fbb/omfbb.htm">Hinweise zur Nutzung</a>
</dd>

<dd class="data-license">
  Metadaten: <span class="label label-danger"></span></br>
  Werke: i.d.R. liegen die Rechte bei den Künstlern
</dd>

I am going to implement a "legacy" field dumping all data set info in JSON format into it, apart from filling all fields as good as possible on a general level. :)

Here's an export example in JSON. The value of the key __raw_meta would end up unescaped in that legacy field. (I hope that all the general information – institution name, data set title, image and description – imports properly.) In the raw data, left the plain <span> tags in place, because they act as logical separator in some cases. In this example, the script detected the meta data and media licenses. However, inserting licenses will fail in quite some cases. Sometimes, no CC license version is named – like in this case. That cannot be imported with our current strict setup. Also, sometimes the version number is within the <span>, sometimes outside and with additional text in the same line that is a bit complex to parse. Apart from that, I think the result is quite OK. I noticed another special case on this particular example: A link to a local file.

[{
  "name": "Botanischer Garten und Botanisches Museum Berlin",
  "dataSets": [{
    "__raw_meta": {
      "Lizenz": [
        "Bilder: <span>CC-BY-SA<\/span>",
        "Metadaten: <span>CC0<\/span>"
      ],
      "Dateityp": [
        "<span>JPEG<\/span><span>CSV<\/span><span>XML<\/span>"
      ],
      "Jahr": [
        "<span>Berlin 2017<\/span>"
      ],
      "Links": [
        "<a href=\"http:\/\/ww3.bgbm.org\/biocase\/downloads\/Schweinfurth\/Collection%20of%20botanical%20drawings%20by%20Georg%20Schweinfurth%20at%20B.DwCA.zip\">Metadaten<\/a>",
        "<a href=\"\/downloads\/daten-2017\/1245_P_BGBM_Schwenfurth_Zeichnungen.pdf\">Datenpr\u00e4sentation<\/a>"
      ]
    },
    "image": {
      "src": "{{ site.baseurl }}img\/daten\/bgbm-georg.jpg",
      "alt": "Georg Schweinfurths Sammlung botanischer Zeichnungen"
    },
    "title": "Georg Schweinfurths Sammlung botanischer Zeichnungen",
    "description": [
      "<p>Georg Schweinfurth, 1836\u20131925, war in erster Linie Botaniker und Pflanzengeograf, forschte und publizierte aber auch ausgiebig in Geografie, Ethnologie, Anthropologie und \u00c4gyptologie. Dabei hielt er seine Beobachtungen nicht nur in anschaulichen Texten fest, sondern vermochte als hervorragender Zeichner die Forschungsobjekte seiner ungew\u00f6hnlich breiten wissenschaftlichen Bet\u00e4tigung ebenso ansprechend wie instruktiv abzubilden. Schon zu Lebzeiten \u00fcberlie\u00df Schweinfurth dem Botanischen Garten und Botanischen Museum Berlin-Dahlem seine in vier Foliob\u00e4nden gebundene Sammlung botanischer Zeichnungen, die w\u00e4hrend des 2. Weltkrieges zusammen mit kostbaren alten Abbildungswerken der Bibliothek und besonders wertvollen Teilen des Herbars in Stollen der Kali-Werke Bleicherode-Ost bei Nordhausen ausgelagert wurden und so die Vernichtung von Herbar und Bibliothek durch Bombentreffer im M\u00e4rz 1943 \u00fcberstanden. In den vergangenen Jahren wurden die 624 erhaltenen Bl\u00e4tter, viele davon in fragilem Zustand, inventarisiert, digitalisiert, wissenschaftlich bearbeitet und im Web zug\u00e4nglich gemacht.<\/p>"
    ],
    "Lizenz": {
      "media": "CC-BY-SA",
      "meta": "CC0"
    },
    "Dateityp": [
      "JPEG",
      "CSV",
      "XML"
    ],
    "Jahr": [
      "Berlin 2017"
    ],
    "Links": [
      {
        "url": "http:\/\/ww3.bgbm.org\/biocase\/downloads\/Schweinfurth\/Collection%20of%20botanical%20drawings%20by%20Georg%20Schweinfurth%20at%20B.DwCA.zip",
        "title": "Metadaten"
      },
      {
        "url": "\/downloads\/daten-2017\/1245_P_BGBM_Schwenfurth_Zeichnungen.pdf",
        "title": "Datenpr\u00e4sentation"
      }
    ]
  }, <...>]
}, <...>]

I have created the import script for the projects. It seems like all information can be imported, except the references to the data sets used by each project. URL aliases are created as well.

As for the data sets, there is a legacy field now receiving all meta data specified in the current page. This field is not accessible/visible for institution users.

Technically, I am done on this issue. I'd just need to run the import at a point in time to be decided. The import should only be run once because there is no tracking on imported entities. If the import fails or some information is missing, all imported entities would need to be deleted before rerunning the import. Deleting is a quick tast, but once everything is imported and imported data sets have been edited, having to rerun the import would mean all those changes would be lost. So it's important to check right after importing whether the import results are not broken in whatever way.

Running the import only once means that, as long as the current site is still online, data sets and projects would need to be maintained in both sites in parallel after the import has run. So the time between importing and putting the new site online should be kept as short as possible.

The general import is technically done, so I guess we can close this issue. If there are still problems resulting from the import or if there should be tickets for manually processing the imported data, we should open new issues.

codingdavinci / relaunch2018

Port archives of current page #68