ICT4SD / Science_Technology_Search

Build a searchable collection of science and technology knowledge useful to implement the Sustainable Development Goals.
https://ict4sd.github.io/2016/09/21/Project_ST_SEARCH/
GNU General Public License v3.0
1 stars 1 forks source link

Progress Beethoven Team #2

Open unite-analytics opened 7 years ago

unite-analytics commented 7 years ago

Hello team Beethoven @ICT4SD/st-search-members ,

I am encouraged by seeing that you have learned how to use docker and install local version of spark and elasticsearch to get and index data from the common crawl.

Just with that knowledge you could already build an initial version of the search engine by crawling only necessary domains, and perhaps by excluding pages which do not contain at least a few of the keywords relating to the SDGs. Nice job!

Also, i noticed some people have started to look into Kibana, which you can very quickly use as the front end for searching. If you use Kibana for the elasticsearch indexes you have you can already complete a solution!

Also, if you can communicate the requirement of space and servers to Professor RP, he might be able to help you with resources in Amazon Web Services. Would you please discuss with RP?

I wanted to ask if you could take a little bit of time to share the items listed below on github so other team members could also help:

These items above would be very helpful for Professor Roberto Gonzales and Carolina Vasquez in the University of Santiago in Chile who are working on a customized search front-end interface for this project. (They are in this github team. Please feel free to interact with them directly).

Finally, at the United Nations in New York, May 15 and 16, 2017 there will be a conference called the Science Technology and Innovation (STI) Forum. At this event there will be discussions about an online platform for STI. I hope from this project we can have a few prototypes of search engines which can be part of that online platform. See more details: https://sustainabledevelopment.un.org/TFM

I look forward to hearing from your next steps!

Best regards, Jorge

lli130 commented 7 years ago

@unite-analytics We will share the related materials asap.

lli130 commented 7 years ago

@unite-analytics Uploaded a PowerPoint with sample image of Search Engine already. We are happy to hear any comments.

lli130 commented 7 years ago

Hi @sebastian-nagel, the team has successfully followed the turtorial to launch Spark instances by using pipelines. However, we faced some issues to do so these days. The error showed that the max spot instance count was exceeded. After tried different accounts, we found the problem may be caused by the AMIs (ami-383baf2f) which belongs to commoncearch. Thanks for any hints or suggestions. problem

sebastian-nagel commented 7 years ago

Hi, sounds more that you need to increase your limit, see http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-limits.html#spot-limits-general

YunyanWu commented 7 years ago

Hi @sebastian-nagel, Thanks for your help!

our team had tried 3 other different new AWS accounts which should not reach the limitation but still got the same error. And we submitted a limit increase request form to Amazon by one of our accounts, but Amazon refused to increase. Now we are going to have another try.

Do you think this may be caused by other issues not related to account limit? or do we need to build our own AMIs to continue? it seems that certain AMIs and VPC were used in cosr-ops repository to configure AWS as the following link: https://github.com/commonsearch/cosr-ops/tree/master/aws/cloudformation

Thanks for any hints or suggestions!

sebastian-nagel commented 7 years ago

Afaik, the limits may be quite low for new accounts (even less than 20 spot instances) until it has been increased by AWS, automatically over time or by request. Note that there are global limits and limits per instance type, see http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-resource-limits.html. I don't think that there is a different reason, just make sure that the limits, instance type and region fit.