alex9311 / information-retrieval

TU Delft, Masters Software Technology, Information Retrieval, 3rd Quarter 2015
1 stars 2 forks source link

Web Crawler #48

Closed alex9311 closed 9 years ago

alex9311 commented 9 years ago

We need to create a web crawler that looks through the app store or play store and gets data on all existing mobile applications

millenniumproof commented 9 years ago

I'll have a crack at making a web crawler as well when I'm done with the CrowdFlower stuff. We can compare results and discuss our progress here.

GizKockesen commented 9 years ago

Could this help? http://blog.singhanuvrat.com/tech/crawl-itunes-appstore-to-get-list-of-all-apps On 19 Mar 2015 05:09, "millenniumproof" notifications@github.com wrote:

I'll have a crack at making a web crawler as well when I'm done with the CrowdFlower stuff. We can compare results and discuss our progress here.

— Reply to this email directly or view it on GitHub https://github.com/alex9311/TUD-Information-Retrieval-Group-02/issues/48#issuecomment-83310515 .

alex9311 commented 9 years ago

Yes that will definitely help! I can take a look into this in the next few days

GizKockesen commented 9 years ago

Cool! I can try it out as well, it looks like I can get it to work. Then I will have finally done something useful for this project! :P On 19 Mar 2015 16:29, "Alex Simes" notifications@github.com wrote:

Yes that will definitely help! I can take a look into this in the next few days

— Reply to this email directly or view it on GitHub https://github.com/alex9311/TUD-Information-Retrieval-Group-02/issues/48#issuecomment-83633258 .

alex9311 commented 9 years ago

Sounds good, I glanced over the code and didn't see anything that was grabbing a "description" field. I may have missed it though. That will be important to add if its not there

alex9311 commented 9 years ago

have you had any luck with this? I got it running and ran it over night but it errored out and I cant get the data out of the output.

GizKockesen commented 9 years ago

Yeah it's been working fine but now all of a sudden it gives an error and I know there is nothing wrong with the code. I'm gonna try it later again, maybe something is wrong with the servers or something. On 20 Mar 2015 19:59, "Alex Simes" notifications@github.com wrote:

have you had any luck with this? I got it running and ran it over night but it errored out and I cant get the data out of the output.

— Reply to this email directly or view it on GitHub https://github.com/alex9311/TUD-Information-Retrieval-Group-02/issues/48#issuecomment-84100702 .

GizKockesen commented 9 years ago

Did you manage to get it to work again? I've been trying and it wasn't working. Then I changed the dump file and it started to work again :) I need to figure out how to get the details to be printed out and not just the urls though. I can save the data onto a csv file but I haven't managed to save more details about the apps other than just the urls. I'm gonna keep working on it :)

On 20 March 2015 at 20:01, Gizem Kockesen gizemkockesen3@gmail.com wrote:

Yeah it's been working fine but now all of a sudden it gives an error and I know there is nothing wrong with the code. I'm gonna try it later again, maybe something is wrong with the servers or something. On 20 Mar 2015 19:59, "Alex Simes" notifications@github.com wrote:

have you had any luck with this? I got it running and ran it over night but it errored out and I cant get the data out of the output.

— Reply to this email directly or view it on GitHub https://github.com/alex9311/TUD-Information-Retrieval-Group-02/issues/48#issuecomment-84100702 .

alex9311 commented 9 years ago

No I haven't yet, does it run super slowly on your machine as well? It takes hours for mine to get through all the apps. Let me know if you make progress! I've been looking into possible alternatives that can grab the description as well

GizKockesen commented 9 years ago

Yeah I think there are too many apps :S On 21 Mar 2015 14:36, "Alex Simes" notifications@github.com wrote:

No I haven't yet, does it run super slowly on your machine as well? It takes hours for mine to get through all the apps

— Reply to this email directly or view it on GitHub https://github.com/alex9311/TUD-Information-Retrieval-Group-02/issues/48#issuecomment-84335801 .

HDking commented 9 years ago

It is possible to scope it down for the demo? for example just take the top 100 of [any topic] and limit our test set to this topic as well?

alex9311 commented 9 years ago

Yeah I'm sure thats what we'll end up doing

GizKockesen commented 9 years ago

I'm trying out another crawler which seems to work better. You can also set a limit on how many apps per category you want to get. But it retrieves all info but the description so I'm trying to add that into the code as well.

On 21 March 2015 at 17:05, Alex Simes notifications@github.com wrote:

Yeah I'm sure thats what we'll end up doing

— Reply to this email directly or view it on GitHub https://github.com/alex9311/TUD-Information-Retrieval-Group-02/issues/48#issuecomment-84378743 .

GizKockesen commented 9 years ago

Did it! I added the lines for retrieving the description of the gathered apps :D I will put the whole thing on github :)

On 21 March 2015 at 17:48, Gizem Kockesen gizemkockesen3@gmail.com wrote:

I'm trying out another crawler which seems to work better. You can also set a limit on how many apps per category you want to get. But it retrieves all info but the description so I'm trying to add that into the code as well.

On 21 March 2015 at 17:05, Alex Simes notifications@github.com wrote:

Yeah I'm sure thats what we'll end up doing

— Reply to this email directly or view it on GitHub https://github.com/alex9311/TUD-Information-Retrieval-Group-02/issues/48#issuecomment-84378743 .

millenniumproof commented 9 years ago

Awesome!

HDking commented 9 years ago

cool!

millenniumproof commented 9 years ago

Now that we have the data from the webcrawler, how are we going to use it? Alex, Gizem and me talked about it during lunch today. With 1.000.000+ apps in the iStore of widely varying quality you don't want to reject an idea because it is similar to something in the app store. Similarity is almost inevitable.

One idea we had was that when a user wants to submit an idea we show the user some similar apps from the iStore, to encourage the user to elaborate more on his idea and what is unique about it. This would be an extra step and thus lower usability, but we didn't have any other good ideas yet. So if anyone has any other ideas, they're welcome?