Open virginiascarlett opened 1 month ago
Some suggestions for the homepage README:
This software library automatically discovers papers and datasets published by HHMI Janelia scientists and stores them in a MongoDB database. The automated scripts, which are run on a weekly basis, also make "educated guesses" about metadata that are of strategic interest to Janelia, such as labs, teams, and employees who contributed to the work. Utility scripts allow the librarian or database administrator to curate these metadata in a semi-automated fashion. A Flask-based application provides a user interface, visualizations, and a REST API.
The two most important collections in the DIS database are the dois
collection and the orcid
collection. Documents in the dois
collection adhere to either the Crossref schema or DataCite schema, depending on which organization issued the DOI. We also add some new metadata fields to these DOI schemas. Documents in the orcid
collection follow a custom schema. Each document in the orcid
collection represents a Janelia employee, and contains either an ORCID, an employee ID, or both.
utility: utility programs meant to be run interactively on the command line, for querying and manipulating database records
Common command line parameters --doi: a single DOI to process --file: a file of DOIs to process (one DOI per line)
Comments on documentation in sync/:
Change references to "OA" to "OA.Works" ("OA" is just scholarly publishing jargon for "open access")
In chart at top of page, update_dois.py and update_orcid.py description: it would be nice if the description more clearly indicated that running this script adds new records to the DB.
Maybe instead of the "Running in production" sections sprinkled throughout the document, just add another column to the table for "Run frequency". And note somewhere at the top that all the timed jobs are run on Jenkins.
For documentation in utility/:
Name | Description |
---|---|
get_citation.py | print citations to the terminal in Janelia newsletter format |
name_match.py | interactively curate the list of Janelia authors (jrc_author) for one or more DOIs |
weekly_pubs.py | a wrapper for the whole weekly curation pipeline |
set_alumni.py | add an alumni tag to a Janelian's metadata |
add_preprint.py | add preprint relationship for a particular preprint-article pair |
add_newsletter.py | add, change, or remove a date in a DOI's jrc_newsletter field |
A wrapper script for the whole weekly curation pipeline. For a DOI or batch of DOIs, it runs update_dois.py, name_match.py, update_tags.py, and get_citation.py, in that order. It will not add DOIs to the database that are already in the database, but it will run the rest of the scripts on those DOIs.
Example usage:
python3 weekly_pubs.py --doi 10.1038/s41586-024-07939-3 --write --verbose
If you want to simply add a DOI to the database without running the rest of the pipeline, run with the --sync_only flag. As always, you must add the --write flag for the change to persist in the database.
For example:
python3 weekly_pubs.py --doi 10.1038/s41586-024-07939-3 --write --sync_only
It is better to add DOIs to the database this way, rather than running sync/bin/update_dois.py directly, because this script performs a couple of addition quality checks on the DOIs.
Add, change, or remove a date in a DOI's jrc_newsletter field. Importantly, only papers that have a jrc_newsletter field will go on janelia.org. Also, these papers' jrc_author field won't be automatically updated (though it can still be manually updated).
Usually, you can add a newsletter date during the weekly curation process, when update_tags.py prompts you to set the newsletter date to today. Sometimes, though, you'll need to set a newsletter date to a date that's not today.
Example usage:
python3 add_newsletter.py --doi 10.1038/s41592-024-02411-6 --date 2024-09-27 --write
python3 add_newsletter.py --doi 10.1038/s41592-024-02411-6 --remove --write
The actual date itself isn't used in our automated systems, so it won't be catastrophic if you set it to a silly date. However, it's nice to set jrc_newsletter to the same date for all the papers that went into a particular newsletter issue.
Add a preprint relationship for a particular preprint-article pair. It's not unusual that a preprint relationship is missed both by Crossref and by the DIS system's "educated guessing". I always check Google for preprints before putting a journal article into the newsletter. If you discover a preprint or preprint relation that's not in our system, use this script to add the relationship and/or the preprint DOI. This is stored in the jrc_preprint field, which is simply an array of DOIs. For a journal article, jrc_preprint will contain the preprint DOI(s), and for a preprint, it will contain the journal article DOI(s).
Example usage:
python3 add_preprint.py --journal [10.1038/s41592-024-02411-6](https://10.0.4.14/s41592-024-02411-6) --preprint 10.1101/2023.07.18.549527 --write
If the preprint DOI is not in the database, the script will prompt you to add it. This script cannot be used to remove preprint relationships.
Add an alumni tag to, or remove an alumni tag from, a Janelian's record in the orcid collection. Alumni are not included in jrc_author, therefore if they have a profile on janelia.org, this paper won't be added to their profile. The alumni field is automatically created and set to true when an employeeId that we have in the orcid collection is no longer in the People system.
Example usage:
python3 set_alumni.py --orcid 0000-1111-2222-3333 --write
python3 set_alumni.py --employee J0123 --write
python3 set_alumni.py --orcid 0000-1111-2222-3333 --write --unset #remove the alumni field
Interactively curate the list of Janelia authors for one or more DOIs. This list is stored in the DOI metadata under the jrc_author field. Because of the way the database is set up, the list of Janelia authors on the browser interface will not reflect your changes to jrc_author. Rest assured, though, your changes will be stored in the database, as long as you use the --write flag. The new Janelia.org uses jrc_author to determine Janelia authors for a paper.
Example usage:
python3 name_match.py --doi 10.1038/s41586-024-07939-3 --write --verbose
Modify tags (and optionally add newletter date) to one or more DOIs. 'Tags' is our jargon for labels representing labs, project teams, or support teams. Tags are derived from the Janelia authors' HHMI People profiles. They include HHMI supervisory organization codes ('supOrg codes'), as well as supOrg names and cost center descriptions.
It is EXTREMELY IMPORTANT that you tag each DOI with ALL applicable tags. So, for example, if you encounter a DOI with possible tags "Srinivas Turaga Lab" and "Srini Turaga Lab", select both.
If a postdoc or research assistant is an author but their group leader is not, do not tag it with that lab's tag(s).
Example usage:
python3 update_tags.py --doi 10.1101/2023.07.18.549527 --write
Print one or more article citations in the Janelia newsletter format. The DOIs must be in the database already. Typical usage:
python3 get_citation.py --doi 10.1038/s41586-024-07939-3
Output looks like this:
Farrants, H, Shuai, Y, Lemon, WC, Monroy Hernandez, C, Zhang, D, Yang, S, Patel, R, Qiao, G, Frei, MS, Plutkis, SE, Grimm, JB, Hanson, TL, Tomaska, F, Turner, GC, Stringer, C, Keller, PJ, Beyene, AG, Chen, Y, Liang, Y, Lavis, LD, Schreiter, ER. A modular chemigenetic calcium indicator for multiplexed in vivo functional imaging. https://doi.org/10.1038/s41592-024-02411-6.
Preprint: https://doi.org/10.1101/2023.07.18.549527
Sometimes, you'll want a citation for a DOI that can't be added to the database because it's not in Crossref. (This can happen with bioRxiv.) In these cases, you can add the DOI to EndNote, export the citation in the "Janelia Science News" format, and feed the resulting text file to this script to to produce a useable citation. (EndNote won't let you export citations without journal names.) Example: Here's a file called endnote.txt, exported from EndNote:
Bulumulla, C, Walpita, D, Iyer, N, Eddison, M, Patel, R, Alcor, D, Ackerman, D, Beyene, AG. Synaptic specializations at dopamine release sites orchestrate efficient and precise neuromodulatory signaling. bioRxiv. 2024:2024.09.16.613338. http://dx.doi.org/10.1101/2024.09.16.613338.
Run the script like so:
python3 get_citation.py --endnote endnote.txt
To produce:
Bulumulla, C, Walpita, D, Iyer, N, Eddison, M, Patel, R, Alcor, D, Ackerman, D, Beyene, AG. Synaptic specializations at dopamine release sites orchestrate efficient and precise neuromodulatory signaling. https://doi.org/10.1101/2024.09.16.613338.
In particular, make sure the section on data sources is comprehensive. Also get some installation instructions up on GItHub.