The basic idea is to dump all of the plasmid data into Elasticsearch, then expose that through slack. If we expose everything (e.g. features, primers, and DNA sequence), you can just let it handle actual search. To do that, we need to 1) get all of the plasmid data and 2) convert it from .dna into some format we can read (.gbk), 3) convert the .gbk into a json document to dump into Elasticsearch, where we finally 4) write the slack interface for our Elasticsearch index.
To handle 1, we need to interface with Quartzy. Quartzy doesn't have an official API, but they have a pretty decent looking unofficial API. I highly recommend playing around with your brower's dev tools, but basically how loading the interface we see works is:
GET https://io.quartzy.com/groups/190392/items?page=1&limit=20&sort=-name to get the most recent 20 entries. Query params can be changed to get more or less. Quartzy responds with a JSON data structure that is pretty helpful:
Of particular interest for dumping all plasmid data is the meta section which defines the total # of items divided into various numbers of pages. You can also iterate as if you were a browser by using the links section.
There is also the helpful field updated_at within each item; if we store this information, we know when plasmids have been updated!
Ok great, to actually get the plasmid files, we need to dump attachments. How do we do this? Well, once you are the shiny owner of a plasmid, get its item id (stored in the id key, 38201556 in the example above), you can GET a description of attachments. For another plasmid in the databse with a different ID, this might be GETting https://io.quartzy.com/items/38201556/attachments, which returns the following JSON:
Once you deal with authentication (ew :(, not that bad, just have to eventually get your BEARER token), GET `https://io.quartzy.com/groups/190392/items, using the query parameters to dump the (JSON) version of the plasmid database. For each plasmid that has been updated since the last check, GET https://io.quartzy.com/items/ITEM_ID/attachments to get a downloadable link. If the attachment is a recognizable plasmid file (.gbk, .dna), download it.
Either doing it manually or using something like Biopython to load the .gbk, decide on a searchable dump format and dump a JSON document. It could be something like:
{
"item_type": "plasmid"
"sequence": "ATCG..."
"resistances": {load-me-from-the-quartzy-metadata}
"features": [
{
"type": "CDS",
"name": "EGFP",
"sequence": "ATCG" <- do we even need this?
"range": [120, 250]
}
]
}
Throw that sucker into Elasticsearch.
Part 4 summary
Ask me for details; I am imagining a Slack interface that gives useful search results. I have never personally used Elasticsearch, but it is the industry standard for stuff like this.
Summary
The basic idea is to dump all of the plasmid data into Elasticsearch, then expose that through slack. If we expose everything (e.g. features, primers, and DNA sequence), you can just let it handle actual search. To do that, we need to 1) get all of the plasmid data and 2) convert it from .dna into some format we can read (.gbk), 3) convert the .gbk into a json document to dump into Elasticsearch, where we finally 4) write the slack interface for our Elasticsearch index.
To handle 1, we need to interface with Quartzy. Quartzy doesn't have an official API, but they have a pretty decent looking unofficial API. I highly recommend playing around with your brower's dev tools, but basically how loading the interface we see works is:
https://io.quartzy.com/groups/190392/items?page=1&limit=20&sort=-name
to get the most recent 20 entries. Query params can be changed to get more or less. Quartzy responds with a JSON data structure that is pretty helpful:Of particular interest for dumping all plasmid data is the
meta
section which defines the total # of items divided into various numbers of pages. You can also iterate as if you were a browser by using thelinks
section.There is also the helpful field
updated_at
within each item; if we store this information, we know when plasmids have been updated!Ok great, to actually get the plasmid files, we need to dump attachments. How do we do this? Well, once you are the shiny owner of a plasmid, get its item id (stored in the
id
key,38201556
in the example above), you can GET a description of attachments. For another plasmid in the databse with a different ID, this might be GETtinghttps://io.quartzy.com/items/38201556/attachments
, which returns the following JSON:Part 1 summary
Once you deal with authentication (ew :(, not that bad, just have to eventually get your BEARER token), GET
`https://io.quartzy.com/groups/190392/items
, using the query parameters to dump the (JSON) version of the plasmid database. For each plasmid that has been updated since the last check, GEThttps://io.quartzy.com/items/ITEM_ID/attachments
to get a downloadable link. If the attachment is a recognizable plasmid file (.gbk, .dna), download it.Part 2 summary
At this point, use the Snapgene command line interface to convert to an actual readable version of the file, such as .gbk
Part 3 summary
Either doing it manually or using something like Biopython to load the .gbk, decide on a searchable dump format and dump a JSON document. It could be something like:
Throw that sucker into Elasticsearch.
Part 4 summary
Ask me for details; I am imagining a Slack interface that gives useful search results. I have never personally used Elasticsearch, but it is the industry standard for stuff like this.