describo / crate-builder-component

A VueJS UI component to build an RO-Crate
MIT License
6 stars 3 forks source link

Entity Identifier that uses arcp protocol is not validated properly #33

Closed alvinsw closed 1 year ago

alvinsw commented 1 year ago

When a crate has an entity that uses arcp protocol as its id, the validator library fails it, which causes it to create a whole bunch of new objects with id prefixed by #. The main issue is with the validator library:

import { isURL as validatorIsURL } from "validator";
const urlProtocols = ["http", "https", "ftp", "ftps", "arcp"];

// this returns false
isURL('arcp://name,cooee-corpus/item/2-157', { require_protocol: true, protocols: urlProtocols });

// this returns true, so the isURL function only works for a strict URL, not URI such as arcp
isURL('arcp://namecooee-corpus.com/item/2-157', { require_protocol: true, protocols: urlProtocols });

If the id fails the validation check, the id will be replaced in this line entity["@id"] = `#${entity["@id"]}`;. From there, it will cause a different bigger problem especially when the data is big. A lot of new entities are being created which makes loading takes forever. I think before a new entity is created and pushed, it should be checked first if there is already existing one.

Test data is attached here as zip file: ro-crate-metadata.zip

marcolarosa commented 1 year ago

Can you please provide a very basic crate with a basic root dataset and one entity using an arcp identifier for testing.

alvinsw commented 1 year ago

An example of a very basic crate:

{
  "@context": "https://w3id.org/ro/crate/1.1/context",
  "@graph": [
    {
      "@type": "CreativeWork",
      "@id": "ro-crate-metadata.json",
      "identifier": "ro-crate-metadata.json",
      "about": {
        "@id": "arcp://name,cooee-corpus/corpus/root"
      },
      "conformsTo": {
        "@id": "https://w3id.org/ro/crate/1.1"
      }
    },
    {
      "@id": "arcp://name,cooee-corpus/corpus/root",
      "@type": "Dataset",
      "@reverse": {},
      "name": "Test",
      "hasMember": [
        {
          "@id": "arcp://name,cooee-corpus/item/1-001"
        }
      ]
    },
    {
      "@id": "arcp://name,cooee-corpus/item/1-001",
      "@type": "RepositoryObject",
      "conformsTo": {
        "@id": "https://purl.archive.org/language-data-commons/profile#Object"
      },
      "name": "Text 1-001 1788 Phillip, Arthur"
    }
  ]
}

You can see that arcp://name,cooee-corpus/item/1-001 becomes a Thing and # is added as a prefix in the editor. To see the performance issue, please use the previously attached zip file as the test data.

marcolarosa commented 1 year ago

There are two issues here.

  1. The code wasn't handling arcp's correctly. That is now fixed.

  2. The slowness was not due to the arcp bug but the massive entity lists on the properties (e.g. hasPart). The issue is that the browser was being crushed trying to render all those DOM elements. The only fix I can think of - that is in this commit - is to paginate those massive lists. So the code now has a default page size of 50 elements and a filter box to filter the list. I figure most people don't need to see the whole list at once. Pagination deals with that. And most people are probably looking for something in the list so filtering solves that.

I tried setting a large page size but when you have a few properties with large arrays on them then that adds up quite significantly in terms of browser load. 50 per page seems to be a reasonable default for now.

I've just built v0.23.0 with this code.

marcolarosa commented 1 year ago

Just built 0.23.1 to fix a couple of issues in the new code.

alvinsw commented 1 year ago

Thanks @marcolarosa I can confirm that the arcp bug is fixed. I think this issue can be closed.