NuGet / Home

Repo for NuGet Client issues
Other
1.49k stars 250 forks source link

What happened to V3 + Preview JSON endpoints? #704

Closed barisbalic closed 9 years ago

barisbalic commented 9 years ago

Hi, I was wondering what happened to the JSON endpoints that were available until around 9 days ago. Are they coming back? If not will you be introducing something similar again?

yishaigalatzer commented 9 years ago

Can you be a bit more specific? What are you looking for exactly

barisbalic commented 9 years ago

Sorry wrote this late and wasn't very clear! I was previously indexing all packages by hitting a few of the endpoints that were exposed, I've listed them below. At some point ~10 days ago they disappeared, I realise that they probably weren't for public consumption but they were very useful. I was wondering if you're bringing anything like these back.

http://preview.nuget.org/ver3-ctp1/islatest/segment_index.json to get the list of segments http://nugetprod0.blob.core.windows.net/ver3-ctp1/islatest/segment_#{number}.json to get the segment content

yishaigalatzer commented 9 years ago

Will get back to you with guidance. These where not meant for public consumption

barisbalic commented 9 years ago

I can achieve something similar hitting the old V2 endpoint repeatedly to get through the whole set, not sure on how long that will be around for. I appreciate your time, thanks very much. Here is something I saw today that made me laugh:

johnataylor commented 9 years ago

In order to index all the package data you should use a structure we have introduced we call the "Catalog."

However, before we go into any detail, you should note that for our team our primary focus has been on rebuilding much of our client stack in the last few months. And, in that process, we have been careful to not directly depend on this structure within those new client bits. Rather what I'm describing here is a long term replacement for the other-than-Visual-Studio uses people make of our REST APIs. For example, people like you, who want to replicate all our metadata and index it for some other purpose. This API therefore is new and as the phrase goes, subject to change. But please take a look and give us your feedback.

The Catalog is essentially an append only database of everything that comes into NuGet. It's "append only" precisely because this makes external indexing and replication extremely easy and efficient. Much more so (for you and for us) than the V2 odata feed. So hopefully this meets your requirements.

The data is stored as a tree of resources. If you were to draw this on paper with the root at the top of the page, you might think of the most recent data being added as a leaf on the bottom right.

The root of the catalog is currently here:

https://api.nuget.org/v3/catalog0/index.json

This endpoint will, itself, be discoverable via a root index file. We haven't added it yet but will soon. The root index is what you can rely on long term and you'll find that here:

https://api.nuget.org/v3/index.json

A short digression, because its important to understand how this file works. When we add entries into this root index we also describe them. The idea is we will add that Catalog endpoint in here (for today you can just hit it directly.) Currently the description consists of a number of "@type" this is designed explicitly to be machine read. Simply find an endpoint, indicated by the "@id" that matches you're needs and you're set. There might be multiple endpoints. There might be multiple types on an endpoint. Yishai has suggested we also add human readable labels in here too; that is a great idea and we'll hopefully get those added in the next couple of days. If you are interested in this sort of thing you'll find that we are following a number of industry standards here: the conceptual model for the data is RDF, the serialization format is JSON-LD which is itself based on JSON, the vocabulary for human readable labels will also follow the appropriate standards like RDFS perhaps SKOS etc. You can read about these standards if you like, but that is not a requirement to get going: you can treat this just like any other well formed JSON file. (These RDF technologies are perhaps a little unfamiliar to the average Microsoft platform hacker, but its interesting to note that they are a big deal with some of the Open Government Data initiatives and those guys have some fairly challenging data scenarios, so we think we are on solid ground at least.)

Back to the "Catalog." If you browse to that file (I use a Chrome with a JSON plugin) you can walk through the tree of package metadata. The tree is three levels deep currently. And the actual package metadata is the leaf nodes. The links have associated commitTimestamps. If you remember the last commitTimestamp you read then you can pick up from where you last left off. You hold onto that on your side. This is essentially a "durable cursor" model. Basically the idea is we can have many agents reading this data and doing so at next to no cost to ourselves. You see, for us, the catalog is just static storage.

But remember this is an append only model. So you will see all the adds. edits and ultimately deletes all appended to that tree. It's your responsibility to apply those changes to your index. If you are interested in this sort of things you might see that we have played around a little with the typical database model: rather than exposing the "tables" of a database as the REST API as is the case in a typical odata deployment, instead we have exposed the "transaction log." We did this because its easier for you to build whatever database or index you want to build from a transaction log than it is from a database: you simply have to apply the changes in order.

The code to read the catalog is quite straight forward. There is a class you'll find in the NuGet.Services.Metadata project. If I remember right its called Collector. It implements a simple push model (perhaps we will get around to having a pull model at some point.) And I think I had some JavaScript node.js code kicking around too (such is the advantage of doing everything in JSON).

One of the primary things we intend to do with the Catalog is use it as the source of data for our Lucene index. In this case we actually store the cursor to the Catalog inside Lucene's Commit User Data. Its important that the cursor is saved in the same transaction as that which updates the Lucene index. Its also important that we check for duplicates in the index as we make that commit. We have yet to fully deploy all this, but as you might imagine it represents a significant simplification from what we have to keep running today.

As a final note, perhaps you would be interested to observe that, as far as it makes good sense, our "internal" processes need not be any different from those we expose to the "external" world. Just as you are expected to crawl a web of data (arranged helpfully as a tree) our internal process do the same.

Hopefully this makes sense. And please let us know how you get on.

yishaigalatzer commented 9 years ago

@barisbalic I'm going to close this issue, as I don't think there is an "bug" to follow on. We can keep conversing here if you have more questions.

barisbalic commented 9 years ago

@johnataylor @yishaigalatzer, thanks both very much for your time, and very detailed responses. This definitely isn't a bug but I've gotten into the habit of communicating with people over Issues in other projects I didn't even think twice about it here, sorry if it's the wrong format.

@johnataylor what you're saying makes a lot of sense, I'm very familiar with transaction logs/wals etc, and I've dealt with RDF and JSON-LD before, although I prefer something more akin to json-api, these are both a big step in the right direction compared to the old OData interface.

I have only have a few remaining questions; 1) All the packages seem to have a type of 'PackageDetails' which doesn't tell me if they are create/update/deletes, your answer just now implies this type field should? 2) It's unclear to me if the commitTimestamp that acts as cursor should be consumed at catalog root level, or something below. That is to say, if page 935 exists in the root, it's commitTimestamp is fixed? No other packages will be added to that page? Or do I have to check the commitTimestamp of all packages within a given page?

Thanks again for being so responsive!

johnataylor commented 9 years ago

1) Updates are simply the whole thing again with the updated values in place. Deletes will appear as type PackageDelete in the Catalog. Currently we haven't wired deletes in this particular catalog 2) You have to check the value all the way down. The idea is the value tells you whether you have to follow the link (the commitTimestamp describes the resource - where the resource is a page of the catalog or a leaf node with data).

Couple of things:

I found the old Node.js code I referred to: https://github.com/NuGet/NuGet.Services.Metadata/blob/master/tests/CatalogTests/Node/collect.js But be warned, I haven't tested this in a while, perhaps its out of date, anyhow its pretty simple.

Otherwise please take a look at: https://github.com/NuGet/NuGet.Services.Metadata/blob/master/src/Catalog/CommitCollector.cs Apologies in advance for my use of Linq here, in my defense, not an unreasonable use:-) and you shouldn't be too concerned, its only across a "page" which you would have already loaded in-memory anyhow.

Thanks for the pointer to json-api. Just to be clear about our agenda here: what we were looking for here was to build a "web of data" as a way to leverage the modern Cloud Computing platform as directly as possible. There is no compute sitting behind this JSON. Just raw cloud storage (we use Azure storage) and sometimes content-delivery-networks. Our "api" is simply the arrangement of our resources - which in our case are static and maintained through some background jobs. As you can imagine this when we go live is significantly simpler and more easily made reliable than running an active web site for the same thing. In fact we end out putting all our focus on Lucene.

One of the things we have played around with is "promoting" properties from resources to the linking parent resource to add to the description there. The commitTimestamp is an example of this, and that data is actually repeated in the resource itself. Another example is nuget:id and nuget:version. We promote these to the Catalog page, and this has some key advantages, for example, if we were to re-index the whole thing we would want to partition the indexing work by id. So having those ids repeated on the page makes that easy.

Thanks for your feedback and do let us know how you get on. I think we are going to close this issue as we are using these github issues to track our active work items (of which there are many!) But don't let that stop you giving feedback.

barisbalic commented 9 years ago

Okay sorry again for using issues, if I have a followup how should I pass it along/contact you?

johnataylor commented 9 years ago

No worries about using issues. Its just that we have recently switched to a mode of being pretty active on addressing issues, and that naturally means Closing them when its beyond a point of being actionable.

So please don't take closing as a negative thing, that's all!

And by all means open another issue. Or reopen this if that's meaningful.

barisbalic commented 9 years ago

@johnataylor sorry for the long delay, things have been busy!

With all due respect, I quite dislike the fact that the kind of action/operation is not explicitly stored, I feel it's the responsibility of the transaction log (the source of truth) to make that obvious, without need for interpretation.

Let's say for example, that I come across an entry for 'NUnit', which I already have stored somewhere, but the version differs, how do I know if this is an update to the existing document, or a new version?

It might be that some other rules that exist in your process, or the package life-cycle, give enough context to solve this problem, but I didn't come across them, and the beauty of the log is that you can remove any question by being explicit.

I appreciate you may disagree or may just no longer be able to make those kinds of changes to the API, so discussion aside, my question: Is there any way that an existing entry that differs only by versions, could be the result of updating some metadata rather than publishing a new package?

Cheers again!