JaneliaSciComp / dis-utilities

Utilities for Data and Information Services
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Review Rob's documentation #80

Open virginiascarlett opened 2 weeks ago

virginiascarlett commented 2 weeks ago

In particular, make sure the section on data sources is comprehensive. Also get some installation instructions up on GItHub.

virginiascarlett commented 6 days ago

Some suggestions for the homepage README:

This software library automatically discovers papers and datasets published by HHMI Janelia scientists and stores them in a MongoDB database. The automated scripts, which are run on a weekly basis, also make "educated guesses" about metadata that are of strategic interest to Janelia, such as labs, teams, and employees who contributed to the work. Utility scripts allow the librarian or database administrator to curate these metadata in a semi-automated fashion. A Flask-based application provides a user interface, visualizations, and a REST API.

The two most important collections in the DIS database are the dois collection and the orcid collection. Documents in the dois collection adhere to either the Crossref schema or DataCite schema, depending on which organization issued the DOI. We also add some new metadata fields to these DOI schemas. Documents in the orcid collection follow a custom schema. Each document in the orcid collection represents a Janelia employee, and contains either an ORCID, an employee ID, or both.

utility: utility programs meant to be run interactively on the command line, for querying and manipulating database records

Common command line parameters --doi: a single DOI to process --file: a file of DOIs to process (one DOI per line)

virginiascarlett commented 3 days ago

Comments on documentation in sync/:

virginiascarlett commented 3 days ago

For documentation in utility/:

Name Description
get_citation.py print citations to the terminal in Janelia newsletter format
name_match.py interactively curate the list of Janelia authors (jrc_author) for one or more DOIs
weekly_pubs.py a wrapper for the whole weekly curation pipeline
set_alumni.py add an alumni tag to a Janelian's metadata
add_preprint.py add preprint relationship for a particular preprint-article pair
add_newsletter.py add, change, or remove a date in a DOI's jrc_newsletter field

weekly_pubs.py

A wrapper script for the whole weekly curation pipeline. For a DOI or batch of DOIs, it runs update_dois.py, name_match.py, update_tags.py, and get_citation.py, in that order. It will not add DOIs to the database that are already in the database, but it will run the rest of the scripts on those DOIs.

Example usage: python3 weekly_pubs.py --doi 10.1038/s41586-024-07939-3 --write --verbose

If you want to simply add a DOI to the database without running the rest of the pipeline, run with the --sync_only flag. As always, you must add the --write flag for the change to persist in the database. For example: python3 weekly_pubs.py --doi 10.1038/s41586-024-07939-3 --write --sync_only

It is better to add DOIs to the database this way, rather than running sync/bin/update_dois.py directly, because this script performs a couple of addition quality checks on the DOIs.

add_newsletter.py

Add, change, or remove a date in a DOI's jrc_newsletter field. Importantly, only papers that have a jrc_newsletter field will go on janelia.org. Also, these papers' jrc_author field won't be automatically updated (though it can still be manually updated).

Usually, you can add a newsletter date during the weekly curation process, when update_tags.py prompts you to set the newsletter date to today. Sometimes, though, you'll need to set a newsletter date to a date that's not today. Example usage: python3 add_newsletter.py --doi 10.1038/s41592-024-02411-6 --date 2024-09-27 --write python3 add_newsletter.py --doi 10.1038/s41592-024-02411-6 --remove --write

The actual date itself isn't used in our automated systems, so it won't be catastrophic if you set it to a silly date. However, it's nice to set jrc_newsletter to the same date for all the papers that went into a particular newsletter issue.

add_preprint.py

Add a preprint relationship for a particular preprint-article pair. It's not unusual that a preprint relationship is missed both by Crossref and by the DIS system's "educated guessing". I always check Google for preprints before putting a journal article into the newsletter. If you discover a preprint or preprint relation that's not in our system, use this script to add the relationship and/or the preprint DOI. This is stored in the jrc_preprint field, which is simply an array of DOIs. For a journal article, jrc_preprint will contain the preprint DOI(s), and for a preprint, it will contain the journal article DOI(s). Example usage: python3 add_preprint.py --journal [10.1038/s41592-024-02411-6](https://10.0.4.14/s41592-024-02411-6) --preprint 10.1101/2023.07.18.549527 --write

If the preprint DOI is not in the database, the script will prompt you to add it. This script cannot be used to remove preprint relationships.

set_alumni.py

Add an alumni tag to, or remove an alumni tag from, a Janelian's record in the orcid collection. Alumni are not included in jrc_author, therefore if they have a profile on janelia.org, this paper won't be added to their profile. The alumni field is automatically created and set to true when an employeeId that we have in the orcid collection is no longer in the People system. Example usage: python3 set_alumni.py --orcid 0000-1111-2222-3333 --write python3 set_alumni.py --employee J0123 --write python3 set_alumni.py --orcid 0000-1111-2222-3333 --write --unset #remove the alumni field

name_match.py

Interactively curate the list of Janelia authors for one or more DOIs. This list is stored in the DOI metadata under the jrc_author field. Because of the way the database is set up, the list of Janelia authors on the browser interface will not reflect your changes to jrc_author. Rest assured, though, your changes will be stored in the database, as long as you use the --write flag. The new Janelia.org uses jrc_author to determine Janelia authors for a paper. Example usage: python3 name_match.py --doi 10.1038/s41586-024-07939-3 --write --verbose

update_tags.py

Modify tags (and optionally add newletter date) to one or more DOIs. 'Tags' is our jargon for labels representing labs, project teams, or support teams. Tags are derived from the Janelia authors' HHMI People profiles. They include HHMI supervisory organization codes ('supOrg codes'), as well as supOrg names and cost center descriptions.

It is EXTREMELY IMPORTANT that you tag each DOI with ALL applicable tags. So, for example, if you encounter a DOI with possible tags "Srinivas Turaga Lab" and "Srini Turaga Lab", select both.

If a postdoc or research assistant is an author but their group leader is not, do not tag it with that lab's tag(s).

Example usage: python3 update_tags.py --doi 10.1101/2023.07.18.549527 --write

get_citation.py

Print one or more article citations in the Janelia newsletter format. The DOIs must be in the database already. Typical usage: python3 get_citation.py --doi 10.1038/s41586-024-07939-3 Output looks like this:

Farrants, H, Shuai, Y, Lemon, WC, Monroy Hernandez, C, Zhang, D, Yang, S, Patel, R, Qiao, G, Frei, MS, Plutkis, SE, Grimm, JB, Hanson, TL, Tomaska, F, Turner, GC, Stringer, C, Keller, PJ, Beyene, AG, Chen, Y, Liang, Y, Lavis, LD, Schreiter, ER. A modular chemigenetic calcium indicator for multiplexed in vivo functional imaging. https://doi.org/10.1038/s41592-024-02411-6.
Preprint: https://doi.org/10.1101/2023.07.18.549527

Sometimes, you'll want a citation for a DOI that can't be added to the database because it's not in Crossref. (This can happen with bioRxiv.) In these cases, you can add the DOI to EndNote, export the citation in the "Janelia Science News" format, and feed the resulting text file to this script to to produce a useable citation. (EndNote won't let you export citations without journal names.) Example: Here's a file called endnote.txt, exported from EndNote:

Bulumulla, C, Walpita, D, Iyer, N, Eddison, M, Patel, R, Alcor, D, Ackerman, D, Beyene, AG. Synaptic specializations at dopamine release sites orchestrate efficient and precise neuromodulatory signaling. bioRxiv. 2024:2024.09.16.613338. http://dx.doi.org/10.1101/2024.09.16.613338.

Run the script like so: python3 get_citation.py --endnote endnote.txt To produce:

Bulumulla, C, Walpita, D, Iyer, N, Eddison, M, Patel, R, Alcor, D, Ackerman, D, Beyene, AG. Synaptic specializations at dopamine release sites orchestrate efficient and precise neuromodulatory signaling. https://doi.org/10.1101/2024.09.16.613338.