BIDS-projects / scraper

Collects data from websites of data science institutions
2 stars 0 forks source link

Use proper memory storage for Scrapy to prevent memory error #6

Closed don-han closed 8 years ago

don-han commented 8 years ago

Discussions to consider:

alvinwan commented 8 years ago

copying my comment from slack

The thing is that document-based noSQL dbs like MongoDB don’t handle relationships between entities very well. So standard JOIN queries are ugly and filters are sometimes impossible or computationally expensive. We can always migrate data from MongoDB to mySQL, though, if this is an issue.

don-han commented 8 years ago

I am not too sure what our schema would look like at this point, and as far as I know RDBMS requires pretty rigid schema from the beginning. The only things we definitely need to keep track of are the text and the url it belongs to, but that's about it. Also, since our project can change direction depending on the results, I think using MongoDB might make the process more flexible without needing to redesign our whole system if we do need to make some changes.

Not too sure how easy the transition from MongoDB from mySQL is, but if it's easy enough, I think for text collection part at least, we might want to start with MongoDB.

alvinwan commented 8 years ago

The migration is doable; let's use MongoDB then!