citation-file-format / ruby-cff

A Ruby library for manipulating CITATION.cff files.
Apache License 2.0
51 stars 14 forks source link

Inconsistent validation between docker tool and GitHub #130

Closed fmigneault closed 1 week ago

fmigneault commented 2 weeks ago

Please note that issues with the validity of single CITATION.cff files may take some time to be picked up by the Citation File Format maintainers themselves. Therefore, if you are reading this issue and know how to validate CITATION.cff files, please help out if you can!

Invalid CITATION.cff file

The file can be found here: https://github.com/stac-extensions/mlm/blob/main/CITATION.cff

cff-version: 1.2.0
message: If you use this standard or software, please cite it using the metadata from this file.
title: Machine Learning Model Extension Specification for SpatioTemporal Asset Catalog
type: software
keywords:
  - mlm
  - Machine Learning
  - Model
  - STAC
url: "https://github.com/stac-extensions/mlm/blob/main/README.md"
repository-code: "https://github.com/stac-extensions/mlm"
license: Apache-2.0
license-url: https://github.com/stac-extensions/mlm/blob/main/LICENSE
identifiers:
  - type: doi
    value: "10.1145/3681769.3698586"
    description: "Conference paper presenting the standard."
  - type: url
    value: "https://stac-extensions.github.io/mlm/"
    description: "Generic URL of the MLM extension schema versions for 'stac_extensions' references."
contact:
  - given-names: Francis
    family-names: Charette-Migneault
    email: francis.charette-migneault@crim.ca
    affiliation: Computer Research Institute of Montréal (CRIM)
    orcid: "https://orcid.org/0000-0003-4862-3349"
  - given-names: Ryan
    family-names: Avery
    alias: rbavery
    email: ryan@wherobots.com
    affiliation: "Wherobots, Inc."
    orcid: "https://orcid.org/0000-0001-7392-1474"
authors: &authors
  - given-names: Francis
    family-names: Charette-Migneault
    alias: fmigneault
    email: francis.charette-migneault@crim.ca
    affiliation: Computer Research Institute of Montréal (CRIM)
    orcid: "https://orcid.org/0000-0003-4862-3349"
  - given-names: Ryan
    family-names: Avery
    alias: rbavery
    email: ryan@wherobots.com
    affiliation: "Wherobots, Inc."
    orcid: "https://orcid.org/0000-0001-7392-1474"
  - &crim
    name: Computer Research Institute of Montréal
    city: Montréal
    region: Québec
    alias: CRIM
    website: "https://www.crim.ca/"
    email: info@crim.ca
    tel: 1 (514) 840-1234
    country: CA
    post-code: H3N 1M3
    address: "101 – 405, avenue Ogilvy"
  - name: "Wherobots, Inc."
    address: 350 California St
    city: San Francisco
    country: US
    post-code: "94104"
    region: California
    website: "https://www.wherobots.ai/"
    location: Floor 1 - Lincoln Towne Center

references:
  - type: software-code
    title: "A PydanticV2 and PySTAC validation and serialization library for the STAC ML Model Extension"
    keywords:
      - stac_model
    repository-code: "https://github.com/stac-extensions/mlm/tree/main/stac_model"
    repository-artifact: "https://pypi.org/project/stac-model/"
    url: "https://github.com/stac-extensions/mlm/blob/main/README_STAC_MODEL.md"
    authors:
      - given-names: Ryan
        family-names: Avery
        alias: rbavery
        email: ryan@wherobots.com
        affiliation: "Wherobots, Inc."
        orcid: "https://orcid.org/0000-0001-7392-1474"
      - given-names: Francis
        family-names: Charette-Migneault
        alias: fmigneault
        email: francis.charette-migneault@crim.ca
        affiliation: Computer Research Institute of Montréal (CRIM)
        orcid: "https://orcid.org/0000-0003-4862-3349"

  - type: standard
    title: STAC MLM specification
    authors: *authors
    identifiers:
    - type: url
      value: "https://stac-extensions.github.io/mlm/v1.3.0/schema.json"
      description: "Latest extension URL used in 'stac_extensions' references."
    - type: url
      value: "https://stac-extensions.github.io/mlm/"
      description: "Generic URL of the MLM extension schema versions for 'stac_extensions' references."

  - type: software-code
    title: "Archive repository of the STAC MLM specification."
    repository-code: "https://github.com/crim-ca/mlm-extension"
    authors: *authors
    identifiers:
    - type: url
      value: "https://crim-ca.github.io/mlm-extension/v1.3.0/schema.json"
      description: "Archive extension URL used in 'stac_extensions' references."
    - type: url
      value: "https://crim-ca.github.io/mlm-extension/"
      description: "Generic URL of the archived MLM extension schema versions for 'stac_extensions' references."

  - type: report
    title: Project CCCOT03 – Technical Report
    abstract: "Project CCCOT03: Proposal for a STAC Extension for Deep Learning Models"
    keywords:
      - dlm
      - Deep Learning
      - Model
      - STAC
    repository: "https://raw.githubusercontent.com/crim-ca/CCCOT03/main/CCCOT03_Rapport%20Final_FINAL_EN.pdf"
    repository-code: "https://github.com/crim-ca/dlm-extension"
    license: Apache-2.0
    license-url: https://github.com/crim-ca/dlm-extension/blob/main/LICENSE
    date-released: "2020-12-14"
    languages:
      - en
    doi: "10.13140/RG.2.2.27858.68804"
    url: "https://www.researchgate.net/publication/349003427"
    institution: *crim
    authors:
      - given-names: Francis
        family-names: Charette-Migneault
        alias: fmigneault
        email: francis.charette-migneault@crim.ca
        affiliation: Computer Research Institute of Montréal (CRIM)
        orcid: "https://orcid.org/0000-0003-4862-3349"
      - given-names: Samuel
        family-names: Foucher
        alias: sfoucher
        orcid: "https://orcid.org/0000-0001-9557-6907"
      - given-names: David
        family-names: Landry
        orcid: "https://orcid.org/0000-0001-5343-2235"
      - given-names: Yves
        family-names: Moisan
        alias: ymoisan
      - name: Computer Research Institute of Montréal
        city: Montréal
        region: Québec
        alias: CRIM
        website: "https://www.crim.ca/"
        email: info@crim.ca
        tel: 1 (514) 840-1234
        country: CA
        post-code: H3N 1M3
        address: "101 – 405, avenue Ogilvy"
      - name: "Natural Resources Canada"
        country: CA
        website: "https://natural-resources.canada.ca/"
      - name: "Canada Centre for Mapping and Earth Observation"
        alias: CCMEO
        country: CA
        website: "https://natural-resources.canada.ca/research-centres-and-labs/canada-centre-for-mapping-and-earth-observation/25735"

  - type: conference
    notes: Conference reference where the demo paper presenting MLM is published.
    title: "GeoSearch’24: Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Searching and Mining Large Collections of Geospatial Data"
    conference:
      name: "SIGSPATIAL’24: The 32nd ACM International Conference on Advances in Geographic Information Systems"
      date-start: "2024-10-29"
      date-end: "2024-11-01"
      city: Atlanta
      region: Georgia
      country: US
    url: https://dl.acm.org/doi/proceedings/10.1145/3681769
    isbn: "979-8-4007-1148-0"
    date-published: "2024-10-29"
    publisher:
      name: "Association for Computing Machinery"
    authors:
      - given-names: Hao
        family-names: Li
      - given-names: Abhishek
        family-names: Potnis
      - given-names: Wenwen
        family-names: Li
      - given-names: Dalton
        family-names: Lunga
      - given-names: Martin
        family-names: Werner
      - given-names: Andreas
        family-names: Züfle

preferred-citation:
  type: conference-paper
  doi: "10.1145/3681769.3698586"
  title: Machine Learning Model Specification for Cataloging Spatio-Temporal Models
  conference:
    name: 3rd ACM SIGSPATIAL International Workshop on Searching and Mining Large Collections of Geospatial Data
    alias: GeoSearch’24
  date-published: "2024-10-29"
  year: 2024
  month: 10
  pages: 4
  loc-start: 36
  loc-end: 39
  location:
    name: Georgia Tech Hotel and Conference Center
    city: Atlanta
    region: Georgia
    country: US
  languages:
    - en
  abstract: >-
    The Machine Learning Model (MLM) extension is a
    specification that extends the SpatioTemporal Asset
    Catalogs (STAC) framework to catalog machine learning
    models. This demo paper introduces the goals of the MLM,
    highlighting its role in improving
    searchability and reproducibility of geospatial models.
    The MLM is contextualized within the STAC ecosystem,
    demonstrating its compatibility and the advantages it
    brings to discovering relevant geospatial models and
    describing their inference requirements.

    A detailed overview of the MLM's structure and fields
    describes the tasks, hardware requirements, frameworks,
    and inputs/outputs associated with machine learning
    models. Three use cases are presented, showcasing the
    application of the MLM in describing models for land cover
    classification and image segmentation. These examples
    illustrate how the MLM facilitates easier search and better
    understanding of how to deploy models in inference pipelines.

    The discussion addresses future challenges in extending
    the MLM to account for the diversity in machine learning
    models, including foundational and fine-tuned models,
    multi-modal models, and the importance of describing the
    data pipeline and infrastructure models depend on.
    Finally, the paper demonstrates the potential of the MLM
    to be a unifying standard to enable benchmarking and
    comparing geospatial machine learning models.
  keywords:
    - STAC
    - Catalog
    - Machine Learning
    - Spatio-Temporal Models
    - Search
  contact:
    - given-names: Francis
      family-names: Charette-Migneault
      email: francis.charette-migneault@crim.ca
      affiliation: Computer Research Institute of Montréal (CRIM)
      orcid: "https://orcid.org/0000-0003-4862-3349"
  authors:
    - given-names: Francis
      family-names: Charette-Migneault
      email: francis.charette-migneault@crim.ca
      affiliation: Computer Research Institute of Montréal (CRIM)
      orcid: "https://orcid.org/0000-0003-4862-3349"
    - given-names: Ryan
      family-names: Avery
      email: ryan@wherobots.com
      affiliation: "Wherobots, Inc."
      orcid: "https://orcid.org/0000-0001-7392-1474"
    - given-names: Brian
      family-names: Pondi
      email: brian.pondi@uni-muenster.de
      affiliation: "Institute for Geoinformatics, University of Münster"
      orcid: "https://orcid.org/0009-0008-0367-1690"
    - given-names: Joses
      family-names: Omojola
      affiliation: University of Arizona
      email: jomojo1@arizona.edu
      orcid: "https://orcid.org/0000-0001-5807-2953"
    - given-names: Simone
      family-names: Vaccari
      email: simone.vaccari@terradue.com
      affiliation: Terradue
      orcid: "https://orcid.org/0000-0002-2757-4165"
    - given-names: Parham
      family-names: Membari
      email: parham.membari@terradue.com
      affiliation: Terradue
      orcid: "https://orcid.org/0009-0004-7594-4011"
    - given-names: Devis
      family-names: Peressutti
      email: devis.peressutti@planet.com
      affiliation: "Sinergise Solutions, a Planet Labs company"
      orcid: "https://orcid.org/0000-0002-4660-0576"
    - given-names: Jia
      family-names: Yu
      email: jiayu@wherobots.com
      affiliation: "Wherobots, Inc."
      orcid: "https://orcid.org/0000-0003-1340-6475"
    - given-names: Jed
      family-names: Sundwall
      email: jed@radiant.earth
      affiliation: Radiant Earth
      orcid: "https://orcid.org/0000-0001-9681-230X"

Context

Running the validation tool

docker run --rm -v $(pwd)/CITATION.cff:/app/CITATION.cff citationcff/cffconvert --validate

I get NO error!

Yet, GitHub still indicates it cannot be parsed.

{6F23D8ED-37D6-444C-9A04-45F6FF1E21BC}

Therefore, my question. Which once is correct? The official validation tool or GitHub? Thanks

jspaaks commented 2 weeks ago

Hi @fmigneault and thanks for making the issue. I looked into it and it seems related to the use of anchors. At least, when I replace the *authors and *crim with the data it references, it seems to work correctly.

I believe this could be a bug in ruby-cff, the library that GitHub uses to render CITATION.cff files on their website. Let me ping @hainesr for you. Rob created and maintains said library. I'll also move this issue there.

Hope this helps!

hainesr commented 2 weeks ago

Hello,

Yes, it's the anchors. It turns out that the underlying Ruby library that handles YAML doesn't process anchors by default, perhaps as a security precaution.

I will enable this and see how it goes. I'll cut a new release and notify GitHub if all is well.

hainesr commented 1 week ago

I have fixed this and pushed version 1.3.0 of the gem.

@arfon, please can you let the relevant people at GitHub know that there's a new release of the Ruby CFF gem? Thanks!

arfon commented 1 week ago

Thanks for the ping @hainesr. I've opened a PR on github/github to update this but it likely won't land until the end of the week as there's a deploy block for a few days during GitHub Universe.

hainesr commented 1 week ago

Great, thanks @arfon.

@fmigneault please can I suggest you try your file on GitHub again in a couple of weeks?

arfon commented 1 week ago

@hainesr @fmigneault – the latest version of ruby-cff is now live on GitHub.com. Could you confirm your issue is now addressed?

fmigneault commented 1 week ago

Yes. The file is correctly parsed and the citation option is rendered on the main repo page. Thanks for the quick fix.

hainesr commented 1 week ago

Amazing 🚀

Thanks @arfon for the quick turnaround and thanks @fmigneault for testing.