Access to Cloud Storage

Sandr0x00 commented 7 years ago

How will we access the cloud storage? Should the API be a wrapper around the cloud storage? Instead of directly querying the cloud server, you query the API which then in turn queries the cloud storage and return is as .json.

Summoning @MusicConnectionMachine/group-2 @MusicConnectionMachine/group-3 @MusicConnectionMachine/group-4

felixschorer commented 7 years ago

I think it'll come down to preference. Direct access might be a little faster and on the other hand accessing it via the API would give us the option to restrict cloud access for API users. We just have to know in order to implement it.

sacdallago commented 7 years ago

I don't get this? Direct access to cloud storage? As in the binaries that postgres stores for the database?

According to what I understood (and the way we thought this) Everything will go through the API! Even groups providing the data will POST the data via API. Hope this answers the question.

edit: If your rationale is: there are going to be some .json files containing data... That will not happen. FS is not a database, fs is.. slow! And FS! :D Whatever data is, it goes into a database. IF it's a JSON, it will get into postgres' JSON fields, or you can set up another database (mongo) to treat as a write-once/read-many JSONs binary dump. --> No binary .json files stored on disk <--

kordianbruck commented 7 years ago

@sacdallago we cannot save all the resulting data from the common crawl parsing in Postgre - can we?

From my understanding today, we concluded, that teams 1 & 2 will have direct Database access. All other groups should be using the API. The real question with REST APIs is: can we make it fast enough to request ALL data. Its rather crazy to have more than 100MBs send DB -> API -> HTTP -> Server. This question hasn't been answered yet, if there is a better way to request large payloads.

Sandr0x00 commented 7 years ago

Group 1 and 2 have direct access? I thought we discussed today, that everything (like @sacdallago said) will go through the API. If they have direct access, I probably should change that in the overview .... No Cloud Storage? Do we store everything in the DB now? Can we please just agree on something and leave it that way? My idea with the whiteboard (today in the meeting) and the overview (#16) was to put these things clearly visible for everybody in one place and not scattered over 10 issues. In a simplified way, it's just an overview, that other groups can work with it and do not have to track every issue from every other group.

nbasargin commented 7 years ago

@sacdallago your comment has merge conflicts with today's meeting results :P

About data: A few numbers about CommonCrawl (if we take only the newest crawl an only the WET files): 57800 files * ~400 MB / file (unpacked) = ~23 TB (all data, worst case) Let's assume we filter away about 99.9% useless data. Then we are left with ~23 GB. This is quite a lot. PostgreSQL can handle it (according to numbers on the Internet) but I agree with @kordianbruck : All those layers of abstraction would slow everything down. The good thing: storing everything in the DB would simplify the project. No one would need to care about getting data from the cloud. So, what is preferred?

About decisions: Right now, the students (at least me) are a bit confused. Could the tutors please agree on a few basic points?

where do we store tons of unstructured data: in the DB OR in the cloud?
how do all groups access the DB: solely through API OR do some groups have direct access to the DB? I just want to prevent situations where a lot of work is done and has to be dropped because the requirements have changed. :)

vviro commented 7 years ago

@nyxathid

regarding the place to store the unstructured data: it all comes down to performance. What are the access patterns be? Would that be multiple processes accessing every file for processing separately? What's the overhead of fetching a file from the blob store on Azure? What's the overhead of fetching a file from the database? How many files would need to be fetched per second in order for the data to be processed in several days? Can a database handle the read load? I don't think that 23 GB is too large for Postgres, but the other questions seem to be more important here. Also, there are reasons why CC uses WET files instead of storing each web page separately. The database is something in between - it bundles the files on disk, so the file system doesn't have to deal with millions of entries, but on the other hand introduces overhead and complexity beyond the file system. A simple batch approach as used in WET files would also be an option. Why not actually write our own WET files with only the selected web pages? The database would store the pointers to the WET files and the offsets in them, and the processors could load a single WET file at a time and be busy with it for a while.
does the question refer exactly to accessing the database, or do you actually mean accessing the data? The API would be necessary for the web application part (groups 5 and 6). It seems to me that the groups 3 and 4 could benefit from having access to all the data at once. Be it in form of own WET files, or a database snapshot, or the files on the blob storage. I'm not sure what the problem with the direct read access to the DB would be, If the team providing the data provides a description of the data structure and format and defers the access part to the standard database API, this should be fine. In my opinion there is no need to put up an additional translation layer between the DB and the consumer for its own sake. If there are specific reasons, like interface stability, versioning, security, or whatever else, they need to be discussed and addressed, but the at least at the moment it seems to me that a documented data structure along with a reference to the DB API could make a sufficient API for the purposes of groups 3 and 4. And if the interface stability, probably the most problematic reason from the list, is indeed a concern, then why not agree on a format to dump all the data into, so that the groups providing the DB would still have a freedom to change the internal data structure while maintaining the interface stability for the consumers who need the bulk data access?

kordianbruck commented 7 years ago

So here is the executive decision we tutors came up with after some long discussions:

We are going to store WET files on "the cloud" / file system, not in the DB. Reasoning is that those files are only used in bulk and do not need to be indexed for full-text search. So Group 2 will have WET as input (the CommonCrawl dataset) and WET (the WET files containing only the filtered pages with relevant entities) and DB (the metadata of the filtered pages in the output WET files, e.g. the location of the filtered pages within the files, their original domains, urls, etc) as output while having write access. 3/4 will have read access to those WET files.
The API shall only be provided for the frontend teams 5&6 - all other teams can directly interface with the database and the blob storage in order to read data

kordianbruck commented 7 years ago

To conclude: We are going to use Azure Blobs to store any WET files - https://azure.microsoft.com/en-us/services/storage/blobs/

@Sandr00 any more questions or can we close this issue? This is not part of the API. Please discuss with @MusicConnectionMachine/group-2 how they want to do Azure Blobs

Sandr0x00 commented 7 years ago

We can close it and discuss it there.

MusicConnectionMachine / api

Access to Cloud Storage #17