bmrb-io / BMRB-API

BMRB API server and client implementations.
GNU General Public License v3.0
9 stars 2 forks source link

NOT and other search operators in queries #44

Open lucubrator opened 6 years ago

lucubrator commented 6 years ago

Is there a way of putting in a NOT operator in a query?

For example, let's say I want to retrieve a list of all entries which does not have any instance of _Entity.Polymer_type'='polyribonucleotide'.

Is it possible to do something like: search/get_id_by_tag_value/_Entity.Polymer_type/[!]polyribonucleotide

Also, what would be the recomended way of retrieveing a list of ALL entries where ALL entities have, let's say, _Entity.Polymer_type'='polyribonucleotide'? Is there some kind of search operator/switch for that?

If not how would one get a list of all possible values for the _Entity.Polymer_type tag? I assume you would use the Get tag enumerations (GET) query. I guess one could then take the union between all the sets of entries where _Entity.Polymer_type is what you don't want it to be, and then take the assymetric difference between (a), this union set, and (b), the set of entries which contain at least one entity with 'polyribonucleotide'. Or is there a faster and more efficient way?

I think I have read through the README/docs here, but I might have missed something on http://www.bmrb.wisc.edu. Appologies if that is the case.

jonwedell commented 6 years ago

Is there a way of putting in a NOT operator in a query?

Not at this time.

For example, let's say I want to retrieve a list of all entries which does not have any instance of _Entity.Polymer_type'='polyribonucleotide'.

Right now the best way to accomplish that is to hit the get_all_values_for_tag endpoint (example) and then apply your own string filter against the results.

If not how would one get a list of all possible values for the _Entity.Polymer_type tag?

The results from the above endpoint will have the full list of values currently present in entries.

Also, what would be the recommended way of retrieving a list of ALL entries where ALL entities have, let's say, _Entity.Polymer_type'='polyribonucleotide'? Is there some kind of search operator/switch for that?

You can accomplish that with the above query as well. Just check that all the values of '_Entity.Polymer_type' in the list for a given ID are all of type 'polyribonucleotide'.

I assume you would use the Get tag enumerations (GET) query.

The enumeration query returns the allowed values we have in the dictionary for that tag. You are correct that for this tag that method would give you all the possible values. For other tags that have non-mandatory enumerations, the tag could potentially have other additional user-specified values, in which case it would be best to get the full list from the get_all_values_for_tag method as described above.

I think I have read through the README/docs here, but I might have missed something on http://www.bmrb.wisc.edu. Apologies if that is the case.

You questions were very well-researched. I hope my response is helpful; feel free to reach out in the future with other questions or suggestions.

Cheers, Jon

lucubrator commented 6 years ago

Many thanks, Jon. This helped to clarify the situation.

I do have a few more questions. I'll state them here, but please tell me if you prefer one topic per opened issue (labelled as a question) in the future.

  1. Do BMRB entries ever get modified? If so, do they get a new BMRB ID, or does the BMRB ID stay the same? Is there a way to get last_modified date/time stamps through the API? Is there a way of getting file size info through the API?

Earlier code relied on that information to sync a certain subset of entries to a local repo, but I have found the API to be far quicker and superior in comparison to the other alternatives.

  1. Can there exist multiple Assemblies in a single Entry? What exactly is a (molecular) assembly?
    • (a) A complex of molecules, (b) a group of interacting molecules, or (c), an assembly of matter onto which measurements are carried out?
    • Are we saying there's one assembly per NMR test tube, or can a set of different experiments be performed on the same molecular assembly as long as the assembly isn't changed/modified? Alternatively, can we even modify the assembly between experiments but still refer to the same assembly ID?
    • Are all contents of an assembly categorized into Entitities?

Excuse me If some of these questions are a bit confused. After reading some of the recommended "publications describing the star format" and after browsing the NMR-STAR documentation/schema, I feel that I still haven't fully grasped the NMR-STAR format. I do have an OK understanding of the syntax/structure of a STAR-file in general. It is with the introduction of NMR data that I start to have trouble. Do you have any recommended reading?

Anyways, once again, thank you for doing this work, Jon. We are benefitting a lot from the BMRB, and hopefully, now also the API!

Cheers,

Noah

dmaziuk commented 6 years ago
1. Do BMRB entries ever get modified? If so, do they get a new BMRB ID, or does the BMRB ID stay the same?

They can get modified without changing the ID, however that rarely happens to core data (if ever). Database links, for example, get modified periodically by BLAST searches.

1. Can there exist multiple Assemblies in a single Entry? 

In theory: molecular dynamics is currently modeled with multiple assemblies in the same entry. We don't have any data, though, except for one or two handmade test-case entries.

What exactly is a (molecular) assembly?

  • (a) A complex of molecules, (b) a group of interacting molecules, or (c), an assembly of matter onto which measurements are carried out?

Good question. In the previous version of NMR-STAR assembly was called "molecular system", if that helps.

   * Are we saying there's one assembly per NMR test tube, or can a set of different experiments be performed on the same molecular assembly as long as the assembly isn't changed/modified? Alternatively, can we even modify the assembly between experiments but still refer to the same assembly ID?

Multiple experiments on the same assembly are common, assembly is not modified.

   * Are all contents of an assembly categorized into Entitities?

Yes. I'd say it works bottom-up: you have ligands like, say, zinc, and monomers e.g. alanine. Those link into entities, which typically would have a residue sequence. One or more entities form an assembly, e.g. a dimer.

OTOH you could describe a mixture of isomers as an assembly, too.

lucubrator commented 6 years ago

Okay, that clears up the confusion about Entries and Assemblies.

They can get modified without changing the ID, however that rarely happens to core data (if ever). Database links, for example, get modified periodically by BLAST searches.

Good to know. I need the DB links, although the ones found through BLAST might not be of the highest importance. In case I need the last modified info in the end, I assume the best way of accessing it (if it is not accsesible through the API) would be to send a request to http://www.bmrb.wisc.edu/ftp/pub/bmrb/entry_lists/nmr-star3.1/?

Side note: I get 500: Internal Server Error seemingly randomly right now. Every other request seem to fail. Both from within Python, with or without headers, and in the web browser, on multiple computers.

jonwedell commented 6 years ago

Apologies for my delay -

I do have a few more questions. I'll state them here, but please tell me if you prefer one topic per opened issue (labelled as a question) in the future.

Either is fine.

Is there a way to get last_modified date/time stamps through the API? Is there a way of getting file size info through the API?

No. For bulk access of the most recent version of all our entries, it is probably best to use the FTP interface to keep an up-to-date local mirror rather than using the API to see which entries have changed.

Running

rsync -av --include='*_3.str' --include='*/' --exclude='*' www.bmrb.wisc.edu::bmrb_entries /tmp/res

Will download the NMR-STAR files for all of our entries. Remove the 'include' and 'exclude' arguments to get the full entry directories. Running the command again will synchronize your local directory with any changes that have happened on the server.

Does your workflow require you be notified when certain entries change, or you just want to stay up to date? Because I see how we don't have a great way to support the former, and that is something I could potentially add to the API if needed.

Also something to note is that if we do make major changes to an entry (author submits corrections after release, for example), we will add a row to the release loop, as seen here. You can query those tags through the API.

Do you have any recommended reading?

We have a publication in process that is an overview of NMR-STAR. I'll update you when it's available.

Side note: I get 500: Internal Server Error seemingly randomly right now. Every other request seem to fail. Both from within Python, with or without headers, and in the web browser, on multiple computers.

Thanks for the report! We were doing some maintenance yesterday that caused this issue. It is now resolved.

Cheers, Jon

*edited to suggest using rsync as per @dmaziuk's suggestion rather than using wget.

dmaziuk commented 6 years ago

I need the DB links, although the ones found through BLAST might not be of the highest importance.

We've been going back and forth on BLAST DB links to some extent: e.g. we used to update "last queried" tag, but that messes up bulk downloads because it updating every file's timestamp... Anyway, full BLAST results are available in entry_directories/bmrXYZ/bmrXYZ.blast.xml -- you can get it and parse it, and/or you may want to take a closer look at the contents of entry_directories/bmrXYZ and see if you want to exclude some of teh files there to save space and bandwidth.

Entry directories are also available via rsync://www.bmrb.wisc.edu/bmrb_entries

lucubrator commented 6 years ago

Thanks again for your replies.

Does your workflow require you be notified when certain entries change, or you just want to stay up to date? Because I see how we don't have a great way to support the former, and that is something I could potentially add to the API if needed.

The latter is fine in my case, no need to be updated on each modification. I only need to re-run and update every week or so, and I only need at most a few hundred entries each time (but often many less than that - maybe 2-10 entries). It is important that the DB links are up to date, however. If a previously downloaded entry has been updated I would want to fetch it again on the next update.

Entry directories are also available via rsync://www.bmrb.wisc.edu/bmrb_entries

Initially I wanted to go the rsync route, but I am also working with something which is supposed to be as platform independent as possible. Trying to make rsync portable, lightweight and platform independent wasn't something I wanted to jump into. As of right now, an "inventory list" with file size and last mod information is created from a request to http://www.bmrb.wisc.edu/ftp/pub/bmrb/entry_lists/nmr-star3.1/, while the rest is done through the API (and locally).

lucubrator commented 3 years ago

Hi again, @dmaziuk and @jonwedell.

Another question has popped up in our lab. Namely, does the BMRB keep a log of which entries are added, removed and modified?

We're specifically looking for that information with regards to the time period 2017-present.

If not, do you know personally if there's been a lot of added as well as deleted or updated RNA entries in that period?

jonwedell commented 3 years ago

Hey @lucubrator -

For our macromolecule entries, we do track when entries are released, withdrawn, and updated with new information from the author in an entry-tracking database.

In addition, minor updates, usually related to typo fixes or minor internal changes can also happen without a record in the release database. Except when trivial, those changes are tracked in the _Release loop present in each entry, which is publicly available. (And to go even further, we track every single change that ever happens to our entries using version control software, though we don't currently offer a way for users to access anything but the most recent version of an entry.)

I can say that very few entries are withdrawn (what we call what you refer to as "deleted") and major updates after release are relatively rare. You can see which entries have been withdrawn here: https://bmrb.io/data_library/withdrawn.shtml

It is possible to calculate specific figures if that would further your research. If so, please reach out to us at help@bmrb.io with the specific data you want, and one of us can write the appropriate query to provide the data you are looking for.