Need way to mirror definitions

jeffmcaffer commented 5 years ago

There is interest in people replicating definition data locally to support robustness/performance, infrastructure control, and privacy. Principles: * Readonly -- clearlydefined.io is still the source of truth with all curations, harvesting, ... done in the production service. * Definitions only -- Harvested data and curations are not included in the mirroring process. * Only point queries -- The production service supports arbitrary queries over definition documents. The local copy only needs point queries based on component coordinates. Options: * rsync-style -- The definitions are just blobs so in theory we could mirror those as files and allow people to read from disk. That exposes the user to internal details of ClearlyDefined. * slave service -- Implement a path through the service code that is read-only and has the mirroring activity built in. This would shut down any write paths, not have a crawler, ... and implement whatever mirroring protocol we decide is best. Random thoughts/topics - [ ] Must all definitions be aggressively computed? Currently we (re)compute definitions on demand in the event of schema changes. We could have the local service fall back to the remote service if the schemas don't match. - [ ] First replication is different. That could be a bulk download of a dump where as keeping up to date continuously replicates recent actions. - [ ] periodic or continuous. Need to determine if the use cases require up to the minute replication or if periodic (hourly, daily) replication is enough. - [ ] should be related to the need for an "event stream" that enables people to track new definitions. - [ ] Local scenarios may use different data store technology from the main service. A simple version would just put the data in the local file system. So this is not a straight record for record mirror. Rather it should use an API to read and write the data using the correct structures. - [ ] Local servers, being read-only, need not ever compute a definition. cc: @jeffmendoza

geneh commented 4 years ago

@jeffmcaffer Today the /definitions/:type/:provider/:namespace/:name/:revision endpoint checks whether a definition is cached. If it isn't cached, the definition is retrieved from a pre-computed definition blob and then cached. If the definition doesn't exist, it is computed based on the harvested data. If there is no harvested data for the definition, the crawler is requested to harvest the corresponding data.

If the harvested data and the curations are out of scope and the definition cannot be found in the definitions replicated store, the crawler won't be notified about the missing definition and it won't be queued up. Is this understanding correct? Would in-memory caching work here or should this type of caching be disabled due to a potential risk of running out of memory?

Regarding the definitions syncing, the following options are possible starting from the least amount of effort. I am also open to any other options and ideas.

The internally-hosted service would access the production definition blob storage in Azure by using a read-only+disabled blob listing SAS token with an expiration date.
The internally-hosted service would access a replicated Azure blob storage using an anonymous SAS token with the same permissions as described above. The blobs would be replicated daily to a different container by an Azure Data Factory pipeline.
Azure Data Factory can also push blobs incrementally to a remote file store using credentials. It would be hard to scale this solution to additional users and become too expensive, so I would advise against it.
The internally-hosted service would access a local file store. If the definition isn't available or the schema version has changed, it will retrieve the data from the hosted service and persist it locally. A hybrid solution may also be copying all the blobs initially and maybe periodically. Pros: the data is mostly up-to-date and the crawler is notified about any missing components. Cons: the service may still be dependent on the hosted service. The definitions may become out of sync/stale for recomputed definitions.
A script can be developed to copy all or some of the blobs based on the blobs' prefix. The script could potentially pull only the blobs that have changed since the last successful run. Azure blob storage does not have an API to retrieve blobs starting from a specific time, so all the blobs would need to be listed using pagination. Pros: the service will be completely self-contained. Cons: the script will have to be scheduled to run daily and can potentially fail due to the high volume of data.

Additionally, Azure Data Factory provides blobs compressing capability. It looks like this capability is only for every blob individually, so it won't be possible to compress all or incremental blobs in one archive. I haven't tested this option but it seems to be the case based on the official documentation.

geneh commented 4 years ago

@bduranc @iamwillbar @pombredanne Could you please review and comment here?

pombredanne commented 4 years ago

My preferred option would be to use rsync as this would eschew a whole lot of complexity. I do not think that exposing the data files/blob is an issue.

geneh commented 4 years ago

@pombredanne We can set up Azure Data Factory pipeline to incrementally copy blobs to Azure file share. The files should be then accessible via Server Message Block (SMB) protocol that works with rsync. Azure file shares are more expensive than Azure blob storage but still relatively cheap. @jeffmcaffer @iamwillbar Please let me know if this option should be pursued.

bduranc commented 4 years ago

Hi Everyone, Sorry for being late to the party...

@jeffmcaffer

Principles: - Definitions only -- Harvested data and curations are not included in the mirroring process.

I am a bit confused about this statement. I thought a definition consists of both harvested and curated data? Is this intended to mean that individual curations and harvest (tool) data would not be available in the mirror, but the computed definition itself (with its baseline harvest data and any applied curations) would still be?

@geneh : I think it may be a good idea to study a few more use-cases where folks may want to consume CD data and processes internally so we have a clear idea about what option would work best. I propose we create a Google Doc and call out to the community in order to obtain specific use-cases. I think this would also help obtain a better understanding of what business needs would require the interval of the data to be replicated/refreshed "on-demand" vs. "periodic" (hourly, daily, etc.)

For one of my team's own potential use-cases, @jeffshi88 did provide a bit of detail in #650. What's proposed in #650 I think is a bit different as it focuses more on mirroring part of the actual CD process in one's own internal environment while still being able to contribute back to the public. The main reasons for this would be to have priority of harvesting resources for one's own workloads and tighter integration with their own internal systems. But I feel there is a general interest to having offline, read-only access to the data as well.

I should be transparent and say my colleagues and I are still determining if this is something our project requires and exactly to what extent... The existing REST APIs do appear to support most common operations of the CD process (harvesting, curation, definition retrieval, etc.) but might become problematic when working with very large volumes of data (when contrasted to the "rsync" approach described here, which as I understand gives people access to the actual definition blobs). Also, if an "rsync" approach does indeed "expose the user to internal details of ClearlyDefined", or otherwise provide more verbose data than would normally be available through the REST APIs, with proper documentation this could be beneficial.

jeffmcaffer commented 4 years ago

@geneh sorry for the delay in responding. The proposal here is to mirror only definitions. That is all the locally running replica service would access. No harvested data, no curations, no recomputing definitions, ... It is a dumb, read-only replica of the clearlydefined.io service.

The replica service both responses to a limited set of API calls and is responsible for mirroring the definitions "locally" (i.e., into the control of the party running the replica). Given the volume of data, the locally replicated data should be locally persistent. If the replica service crashes or needs to be redeployed/updated, the data should not have to be re-mirrored.

It is highly likely that different replica hosts will want to persist their mirrored copy differently. Amazon folks for example will reasonably want to put the data in S3 or some such. Whatever folks are using to support their engineering system. So we cannot assume that they have it, for example, on a local file system. Replica hosts should be able to plug in different definition storage providers.

Unfortunately this implies that the mirroring system needs internally to call the storage provider APIs and not be concerned about how the data is stored.

I'm not sure if/how rsync can help in that context as IIRC it assumes a filesystem copy of the millions of definition files and then syncs these copy with the master. Assuming that the replica has it's copy of the definitions available in an rsync compatible filesystem is likely too restrictive.

jeffmcaffer commented 4 years ago

@bduranc great feedback. For this issue we are looking to enable "at scale" users of ClearlyDefined to operate with confidence, independence and privacy. These are concerns we've heard from several folks (including my own team at GitHub). clearlydefined.io has best effort SLAs. While many of the folks running the service are responsible for their own company's engineering system, they are not responsible for yours 😄 Folks reasonably want better control over systems that can, in theory, break their builds and disrupt their engineering systems.

In that scenario, only the definitions need be mirrored. As you observed, the definitions are a synthesis of the harvested data and optional curations. Since that processing has already been done and the resultant definitions mirrored, the harvested data and curations need not be mirrored.

Your points about tool chain integration and curation contribution are great (thanks for keeping that in mind!). Given the range of tools and workflows, I suggest that we keep that separate from this discussion and see how it evolves. make sense?

geneh commented 4 years ago

Thanks, @jeffmcaffer! In this case I think the following should be the best option both for the users concerned about their confidence, independence, and privacy as well as the ClearlyDefined project interested in harvesting as much data as possible:

Come up with a new way through configuration to indicate whether the service should run in a replica service mode and allow only a limited set of web APIs as described above.
Enable a new replica definitions store. Initially, the new capability may be tested with either a local file store or a new blob storage container. The algorithm for a definition retrieval should be as following: a. Get a definition from the replica store. b. If the definition exists in the replica store, respond with the definition. Call api.clearlydefined.io to retrieve a possibly updated definition and store it in the replica store. This operation should be non-blocking and shouldn't fail if, for example, api.clearlydefined.io service is down. c. If the definition doesn't exist, call api.clearlydefined.io to retrieve and possibly trigger harvesting of the missing definition, respond, and store the data in the replica store. If, for example, api.clearlydefined.io service is down, the web API calls should not fail.
The community should contribute code for a new store provider, for example Amazon S3. The users of this feature can optionally be provided with a read-only + list blobs SAS token with an expiration date for a one time definition blobs copying if needed. The copying script is also out of scope and should be contributed by the users.

Please let me know if it makes sense.

jeffmendoza commented 4 years ago

Step 2 should just be:

a. Get a definition from the replica store. b. If the definition exists in the replica store, respond with the definition.

Everything else is lower priority (maybe do in the next iteration), and should be behind a feature flag when implemented. Those who care about security/privacy will not want an automated call to api.clearlydefined.io based on their internal usage.

For those who are mirroring for reliability reasons only could use this proxy/caching behavior, but we might be better served working on a more robust copying tool and scheme (up for discussion).

jeffmcaffer commented 4 years ago

Most of the mirroring scenarios I've seen where some combination of:

privacy -- teams not willing to surface requests related to the open source they are using
control -- teams not wanting to have a third party service as part of their critical engineering system infra
scale -- teams wanting to party on the service as hard as they like and not worry about taking down the public service or getting rate limited
disconnected -- machines not connected to the internet (so can't call clearlydefined.io) on demand but can access a more private data store

I light of these requirements, I think we need a solution that includes

read-only mode service as suggested in 1 above
local/user-controlled storage. See 3 above
some means of mirroring (push or pull) data from clearlydefined.io to the "local" service.

It is that last point that has the most questions around it. Would love to talk about options there. On-demand fallback to clearlydefined.io is an interesting option to have but goes against several of the requirements.

eShuttleworth commented 4 years ago

I'd like to use CD to identify open source components in closed source firmware by cross-referencing hashes from extracted files with those from CD. Making API requests isn't a realistic option because we have several million hashes and would rather avoid bottlenecking on web requests. The rsync-like option seems optimal for my use case, as I'll almost certainly need to transform the data again, and I have no problem having to deal with CD internals where necessary.

I know that AWS allows for requester pays buckets, this seems like a good application thereof assuming that Azure has a similar option.

jeffmcaffer commented 4 years ago

thanks for the detail @eShuttleworth. A quick clarification on your use case to ensure that ClearlyDefined is applicable here... Are these extracted files source? binaries taken from packages? Binaries that you (someone not the package publisher) built? ClearlyDefined will only have data related to files found in the canonical packages or corresponding source repos.

eShuttleworth commented 4 years ago

These are files that have been extracted from images of *nix devices, usually containing a mix of binaries compiled from source and from packages. I don't expect to be able to use ClearlyDefined to get too much information about these binaries, but I am hoping that it will help identify packages from interpreted/JIT languages like Python, JS, and Ruby.

pombredanne commented 4 years ago

I am making this a Google Summer of Code idea

zacchiro commented 4 years ago

Together with @jeffmendoza and @pombredanne, we have drafted a GSoC topic in the context of Software Heritage for mirroring and integrating ClearlyDefined data into the Software Heritage archive (and also worok on integration aspects in the converse direction, which are not relevant for this issue so I'm leaving them aside for now).

Completing such task will not address this issue in its full generality, but hopefully it will provide a useful return on experience on a first mirroring attempt of ClearlyDefined data.

romanlab commented 4 years ago

My team and I recently integrated with CD to pull packages metadata, focusing on licencing. Our use case is quite simple but we faced challenges that a mirroring solution can greatly help with.

One thing that wasn't mentioned earlier but can be a potential use-case(it is for us) is owning the data(definitions) to run more complex, business specific queries on the data. For this use-case, replicating the service won't work. I see 2 scenarios for this:

The data needs to be combined with other business specific data to run queries on all the info. e.g. package info combined with a customer's manifest data(package.json).
The queries need to be complex(e.g. mongo's Aggregation Pipeline complexity) but are not generic enough to be implemented as part of the service.

Other challenges were already mentioned but I'll add them here for completeness.

Initial replication - We ran a script that pulled all the relevant data, respecting the API rate limit. We populated our DB(postgres) with the results.
Ongoing updates - We have a daily process that tries to refresh everything that's relevant for our use case(~40K packages), respecting the API rate limit(takes approx. 8 hours)
New data - If we encounter a package we don't have any data for we query the API directly to fetch everything. This adds the package to the daily update process to keep it up-to-date.

We have a solution for these challenges but it's flaky and will require further work and tweaking for stability, robustness and to avoid rate limit. It's also not very scalable as the number of packages relevant for us grows.

romanlab commented 4 years ago

@jeffmcaffer

A few examples of use-cases where data replication will be more beneficial than service replication:

Cross referencing internal data with ClearlyDefined data
- Getting all the licenses for specific versions of a list of packages
- Joining license information with file content - for NPM, how many MIT packages have also an npm test command in their package.json
Researching trends - MIT trend in NPM
Packages with License X have also file Y - How many MIT packages have also a Readme file.

These are just a few examples but anything that requires consuming/processing a large dataset will be easier with access to the data instead of a web service.

pombredanne commented 4 years ago

FWIW, I have developed a reasonably robust tool that can sync and keep syncing the whole ClearlyDefined dataset and I will release it tomorrow to present in the weekly call.

pombredanne commented 4 years ago

FWIW, I have developed a reasonably robust tool that can sync and keep syncing the whole ClearlyDefined dataset and I will release it tomorrow to present in the weekly call.

@zacchiro @romanlab @kpfleming @jefshi88 @bduranc @fossygirl and all: I published the ClearCode toolkit repo there https://github.com/nexB/clearcode-toolkit The slides used for the presentation are in there https://github.com/nexB/clearcode-toolkit/tree/32f310669603d17c9adc594104694db0a3f0a878/docs The bulk of the contributions are from @JonoYang and @majurg and all the credits for this are for them. :bow:

@romanlab I added a new option to fetch only the definitions in https://github.com/nexB/clearcode-toolkit/tree/only-def-and-misc based on your request during the meeting :)

@jeffmcaffer you reached out to get the details since you could no join yesterday, here they are ^

All: at this stage my suggested and preferred course of action would be to adopt this tool as the solution for this ticket. It works, it is tested and is available now.

Any objection? if that is so please voice them here.

Separately it would be greatly helpful to work to fix some of the issues I faced such as:

Failed to fetch harvest #688
Failed to fetch data through the API #675
Inconsistent number of Cocoapods definitions in the stats and fetchable through the API #670

And in a lesser way:

Failure to search for numeric namespace #674
Also sort by last updated date #669
Sort by newer doesn't sort by newer release date #465

jeffmcaffer commented 4 years ago

Thanks for this @pombredanne. Interesting. I took a quick look at the slides and have some thoughts (below). Happy to follow up in a call with interested parties. My apologies for missing the call where I suspect at least some of this was discussed.

Seeding a new mirror (step 1) would likely be better as some sort of bulk operation leveraging something in Azure blob. I don't recall the details abut 10s of millions of individual file downloads will be painful
I'd like to understand step 3 in practice. this has been a main point of design variation. Is it push? pull? is there a cursor? or is it based on arbitrary data? how is it different from step 1?
is the intention that this be part of ClearlyDefined or a separate tool?
- If yes, i'd very very much like to not introduce more tech stack. That just limits the people who can work on the system and means that code and structures cannot be shared and when we change something we have to work in two very different tech stacks.
- if no, how would you keep the mirroring tech in sync with changes in, for example, the data structures in ClearlyDefined?
Is there a concrete scenario for step 5? I'd be concerned about doubly mirrored data being non-authoritative and lagging. Could/should such mirrors just run against ClearlyDefined itself?
What is the relationship between the API for #4 and that of ClearlyDefined? I'm concerned about bifurcating the space and ending up with integrations (e.g., ORT, GitHub, ...) that work against one API and not the other or forcing integrators (and potentially users) to do extra work.
Do you see concrete use cases for mirroring the harvested data? Most scenarios we've encountered really just need the definitions.

Would love to know more about the Cloudflare issues. That feels like a separate topic. If Cloudflare is causing problems, they should be fixed or Cloudflare eliminated. People should not *have to set up a mirror to get reliability.

pombredanne commented 4 years ago

Hi @jeffmcaffer than you for your reply!

You wrote:

Seeding a new mirror (step 1) would likely be better as some sort of bulk operation leveraging something in Azure blob. I don't recall the details abut 10s of millions of individual file downloads will be painful

That's a one time operation so that's not a big concern now that I am over that hump. I am working out something so we make a seed DB available for public consumption, making this a non-event.

I'd like to understand step 3 in practice. this has been a main point of design variation. Is it push? pull? is there a cursor? or is it based on arbitrary data? how is it different from step 1?

I am assuming you mean step 3 and step 1 in the presentation at https://github.com/nexB/clearcode-toolkit/tree/32f310669603d17c9adc594104694db0a3f0a878/docs Here is a copy for reference: index

Step 1 and 3 are the same: items are fetched (so I guess pulled) from the ClearlyDefined API and stored in the filesystem and/or in a database. There is also a command line utility to import a filesystem layout to a DB layout since we started with files until that proved impractical, and then switched to using a DB.

is the intention that this be part of ClearlyDefined or a separate tool?

It makes sense to me and I will maintain that tool whether this happens or not.

If yes, i'd very very much like to not introduce more tech stack. That just limits the people who can work on the system and means that code and structures cannot be shared and when we change something we have to work in two very different tech stacks.

This is a tool that you can checkout and runs as-is with minimal configuration, so IMHO that's not an issue, especially since Python is already part of the node.js stack and required for node-gyp.

if no, how would you keep the mirroring tech in sync with changes in, for example, the data structures in ClearlyDefined?

It uses the API, so it will have to adapt to any API changes. I hope that such changes are uncommon and small as they have been in the past, so that's unlikely to be an issue.

Is there a concrete scenario for step 5? I'd be concerned about doubly mirrored data being non-authoritative and lagging. Could/should such mirrors just run against ClearlyDefined itself?

The scenario for step 5 is a case where I use the data in an air-gaped, isolated environment with no internet access, so by definition I cannot call ClearlyDefined or anything from there. The process is therefore:

maintain a mirror in a place that has ClearlyDefined access
there run the CLI utility to export/backup data mirrored since some date
use sneakernet or else to copy the exported/backup data to your private network
there run the CLI utility to import/load that data chunk

What is the relationship between the API for #4 and that of ClearlyDefined? I'm concerned about bifurcating the space and ending up with integrations (e.g., ORT, GitHub, ...) that work against one API and not the other or forcing integrators (and potentially users) to do extra work.

It is completely unrelated yet a highly similar and simplified version but could easily be made to have the same semantics. One important point is that there is no mongodb in the loop at all, which has been an issue license-wise.

Do you see concrete use cases for mirroring the harvested data? Most scenarios we've encountered really just need the definitions.

I sure do and I have mirrored them all. I have a GSoC student that is looking at the data to help spots ScanCode license detection inconsistencies using stats and machine learning. All the cases where you want to to trust but verify would need them too, as well when there are curation that require more digging and getting back to actual ScanCode scans (say if you report and/or curate a license as OTHER then the underlying data is needed)

Would love to know more about the Cloudflare issues. That feels like a separate topic. If Cloudflare is causing problems, they should be fixed or Cloudflare eliminated. People should not *have to set up a mirror to get reliability.

I agree that's a separate topic. I have no idea what the core issue may be, I just happen to have traced Cloudflare as the cause of hiccups in sustained API calls. It could just be that any glitch looks like a Cloudflare issue since they are fronting everything? That's minor anyway now that I have a complete a base seed DB mirror and that I only increments.

pombredanne commented 4 years ago

@romanlab FYI, I just merged https://github.com/nexB/clearcode-toolkit/pull/20 that added the ability to mirror only definitions as we discussed during the call. Feel free to reach out for help as needed.

zacchiro commented 4 years ago

Seeding a new mirror (step 1) would likely be better as some sort of bulk operation leveraging something in Azure blob. I don't recall the details abut 10s of millions of individual file downloads will be painful

We briefly discussed during and after the call how to go about this. Various interested parties (including Software Heritage) can easily offer storage space for the seed DB, but the problem is supporting egress traffic for downstream mirrors starting from scratch. Torrent is also a possibility, but it would still require a decent amount of seeders participating.

Do you see concrete use cases for mirroring the harvested data? Most scenarios we've encountered really just need the definitions.

I agree with both Philippe's points about the existing of such use cases, and with you that mirroring only the definitions will probably be a much more common use case. @pombredanne: given the new ability to mirror only definitions, would it also be possible to seed an initial mirror whose main aim is to mirror only definitions? If so, how much will that change the volume of the initial seed DB to be hosted? It might be worth having the two different kinds of initial seed databases if the demand of use cases is very different (as I think it is).

pombredanne commented 4 years ago

@zacchiro re:

given the new ability to mirror only definitions, would it also be possible to seed an initial mirror whose main aim is to mirror only definitions? If so, how much will that change the volume of the initial seed DB to be hosted? It might be worth having the two different kinds of initial seed databases if the demand of use cases is very different (as I think it is).

my hunch is that this should be a few 10's of GB, therefore much easier to handle indeed. @majurg since you own the DB it would be interesting to get a rough cut estimate of what a dump of definitions only would be. As a side note, I am not sure that it would be easy to make a selective dump of a PostgreSQL table slice BUT it will be easy (even if a tad slow at first) to dump the loadable JSON using the DB-agnostic dump-load utility that @majurg wrote https://github.com/nexB/clearcode-toolkit/tree/master/etc/scripts (we would need to add a filter for definitions only)

pombredanne commented 4 years ago

@zacchiro re:

Seeding a new mirror (step 1) would likely be better as some sort of bulk operation leveraging something in Azure blob. I don't recall the details abut 10s of millions of individual file downloads will be painful

We briefly discussed during and after the call how to go about this. Various interested parties (including Software Heritage) can easily offer storage space for the seed DB, but the problem is supporting egress traffic for downstream mirrors starting from scratch. Torrent is also a possibility, but it would still require a decent amount of seeders participating.

To your earlier point 10's of GB as an order of magnitude becomes much simpler and I could (relatively) straightfowardly open up a server for reliable resumable rsync fetching of the seed data (possibly split in chunks, but that's JSON so no biggie). We use dedicated hardware with no volume cap so egress should not be a major issue, only the base HW lease costs is :) That could also be permanent torrent seeds.

steven-esser commented 4 years ago

@pombredanne @zacchiro

>>> sum(len(cd.content) for cd in CDitem.objects.definitions().iterator())
104262769785

If that number is in bytes, then its ~105 GB for all the definitions (in compressed form)

pombredanne commented 4 years ago

@zacchiro that makes it a much smaller data set alright, roughly 10 times smaller and much simpler to seed alright, including doing increments! Let's do it !

@majurg I wonder what the most effective way to craft selective dumps then:

PostgreSQL COPY? https://stackoverflow.com/a/1746215/302521
or instead using the API and CLI utility including some update to split that in chunks?

steven-esser commented 4 years ago

I believe the REST API could be modified to do what you describe to handle increments @pombredanne. The mirrors could even update via http if desired.

A PostgeSQL COPY script would probably work as well. In my opinion, it depends on the use-case and hardware setup

pombredanne commented 4 years ago

@majurg Thanks I am bringing this topic in the agenda again for te July 6th 8:00AM PDT call on Discord.

pombredanne commented 4 years ago

I would like to move swiftly enough here, so unless there is an objection in the next few days, my plan would be to:

craft and release a base data dump for definitions for public consumption
promote ClearCode as the solution to this ticket
closed this ticket as solved

pombredanne commented 4 years ago

@jeffmcaffer re again:

If yes, i'd very very much like to not introduce more tech stack. That just limits the people who can work on the system and means that code and structures cannot be shared and when we change something we have to work in two very different tech stacks.

Actually that's even a non issue IMHO since ScanCode is already part of the overall stack and requires Python.

sdavtaker commented 1 month ago

Hi, I read the issue's comments thread and it seems there were large advancements in this project, but it became stale at some point around 2-3 years ago. Is what is missing tracked somewhere else?

clearlydefined / service

Need way to mirror definitions #386