alex9311 / information-retrieval

TU Delft, Masters Software Technology, Information Retrieval, 3rd Quarter 2015
1 stars 2 forks source link

CrowdFlower #10

Closed millenniumproof closed 9 years ago

millenniumproof commented 9 years ago

I was thinking maybe we should also use CrowdFlower for content generation. We need a sizable amount of data before we can perform any significant Information Retrieval techniques on it. The more data we have the better we can test our platform. The job could be something simple like:

Write a one sentence description of a Mobile Application, existing or non-existing.

We could launch the job on monday and get a good idea of how CrowdFlower returns results.

GizKockesen commented 9 years ago

Yeah that's a good idea, that way we'll be testing crowdflower as well :) On 28 Feb 2015 11:37, "millenniumproof" notifications@github.com wrote:

I was thinking maybe we should also use CrowdFlower for content generation. We need a sizable amount of data before we can perform any significant Information Retrieval techniques on it. The more data we have the better we can test our platform. The job could be something simple like:

Write a one sentence description of a Mobile Application, existing or non-existing.

We could launch the job on monday and get a good idea of how CrowdFlower returns results.

— Reply to this email directly or view it on GitHub https://github.com/alex9311/TUD-Information-Retrieval-Group-02/issues/10 .

PetervB commented 9 years ago

Yeah, sounds good!

millenniumproof commented 9 years ago

Apparently the CrowdFlower part is more important than we expected so I'll put more time into making it dynamic. I'll try working out the cURL requests again and make a script that runs periodically sending new data to CrowdFlower. Gizem, if you want you can look into getting the results dynamically using webhooks. https://success.crowdflower.com/hc/en-us/articles/201856249-CrowdFlower-API-Webhooks You can use the php function: json_decode() to parse the JSON data.

GizKockesen commented 9 years ago

I will check it out right away :)

On 19 March 2015 at 05:16, millenniumproof notifications@github.com wrote:

Apparently the CrowdFlower part is more important than we expected so I'll put more time into making it dynamic. I'll try working out the cURL requests again and make a script that runs periodically sending new data to CrowdFlower. Gizem, if you want you can look into getting the results dynamically using webhooks.

https://success.crowdflower.com/hc/en-us/articles/201856249-CrowdFlower-API-Webhooks You can use the php function: json_decode() to parse the JSON data.

— Reply to this email directly or view it on GitHub https://github.com/alex9311/TUD-Information-Retrieval-Group-02/issues/10#issuecomment-83311724 .

GizKockesen commented 9 years ago

Okay I had a muuuuuch harder time understanding things than expected unfortunately :( Crowdflower seems to be just another program that's not well documented. So far, I have this php code that inserts the data from the JSON file (from Crowdflower's job results) to the database. I realize that you probably have this already Miriam, but I had to do it myself too to understand what was going on and what I was doing :S Now I have to figure out how to get the JSON file dynamically :S

I didn't want to put up the php file to github since it's far from being ready so I will just paste it here:

<?php $servername = "localhost"; $username = "root"; $password = "root"; $dbname = "Sparked";

// Create connection $conn = new mysqli($servername, $username, $password, $dbname); // Check connection if ($conn->connect_error) { die("Connection failed: " . $conn->connect_error); }

// Read the JSON file contents $jsondata = file_get_contents('job_702403.json');

// Convert JSON object to php associative array $data = json_decode($jsondata, true);

// Get the results $id_idea = $data['data']['id']; $accepted = $data['results']['judgments']['data']['should_this_idea_be_rejected']; $reason = $data['results']['judgments']['data']['why_should_this_idea_be_rejected'];

// Insert into the table $sql = "INSERT INTO Screening_result(id_Idea, accepted?, reason) VALUES('$id_idea', '$accepted', '$reason')";

$conn->close(); ?>

millenniumproof commented 9 years ago

The script I wrote was to parse the csv file of the results. It's in the file 'send_ideas_from_csv_to_mysql.php', but to do it dynamically you have to parse from JSON. It's all new to me as well. We'll just try to figure it out as we go :) I'll look into the webhooks as well. So we can check each other's work.

One thing I notice looking at the code is that the data field in the database for 'accepted' is not text. So you have to change a 'Yes' to 1 and a 'No' to 0. You can look at my code.

GizKockesen commented 9 years ago

Oh okay thanks!

On 19 March 2015 at 16:57, millenniumproof notifications@github.com wrote:

The script I wrote was to parse the csv file of the results. It's in the file 'send_ideas_from_csv_to_mysql.php', but to do it dynamically you have to parse from JSON. It's all new to me as well. We'll just try to figure it out as we go :) I'll look into the webhooks as well. So we can check each other's work.

One thing I notice looking at the code is that the data field in the database for 'accepted' is not text. So you have to change a 'Yes' to 1 and a 'No' to 0. You can look at my code.

— Reply to this email directly or view it on GitHub https://github.com/alex9311/TUD-Information-Retrieval-Group-02/issues/10#issuecomment-83642648 .

millenniumproof commented 9 years ago

I got the dynamic sending of data to CrowdFlower working. One reason it didn't work at first was that the url given in the CrowdFlower api didn't work. When I changed the 'https' to 'http' it worked fine just had to fix the formatting a bit. I didn't have any problems sending data to a running job. We should decide whether we send and receive data in batches or row by row. I'll first try and get the webhook for receiving the results working.

alex9311 commented 9 years ago

sounds great! Would it be easiest to send the data row by row? Specifically, we would send it when a user submits a new question. Does that sound right?

Are you sending it through php?

millenniumproof commented 9 years ago

Yeah, it's a php script that sends a cURL request. It's pretty simple.

Sending row by row seems best to me as well. If you send periodically you would have to wait till you get the results so you wouldn't send the same data twice. Or you would have to adjust the database to make note if data has been sent, but that seems a bit cumbersome.

If we want to send row by row I would need the idea_ID which is (I think?) generated in the database. Any idea how to get the ID back when you insert a new idea in the database.

alex9311 commented 9 years ago

Yep! I just committed some fixes that includes getting the ID from an idea that was just inserted. See here on line 68-69

millenniumproof commented 9 years ago

Great, I'll add the function to the 'submit_functions' script then.

millenniumproof commented 9 years ago

Committed straight to master again, haha. I promise, I didn't break anything.

alex9311 commented 9 years ago

no worries, ive been doing the same thing haha

GizKockesen commented 9 years ago

Oh nicee!! I'm sorry I couldn't be of much help in figuring out how to get the results dynamically :( On 22 Mar 2015 14:54, "Alex Simes" notifications@github.com wrote:

no worries, ive been doing the same thing haha

— Reply to this email directly or view it on GitHub https://github.com/alex9311/TUD-Information-Retrieval-Group-02/issues/10#issuecomment-84611006 .

millenniumproof commented 9 years ago

No problem. The webcrawler stuff is awesome!

millenniumproof commented 9 years ago

For some reason the code for sending data to crowdflower isn't working anymore. I haven't figured out what the problem is yet.

alex9311 commented 9 years ago

Was it working on the server before?

millenniumproof commented 9 years ago

I'm running the code that I used for testing before. I didn't change anything. The same code that worked before doesn't work anymore. Not getting any error, seems like it times out when trying to connect to CrowdFlower. :S

alex9311 commented 9 years ago

Yeah but were you testing on the server before? The reason I ask is that it might be a port issue with the server. But if it was working on the server before we know its not that

millenniumproof commented 9 years ago

Hmm, I thought I did test it on the server, but maybe not. I just tried running from local server and it works there. So maybe it is a port issue with the server.

millenniumproof commented 9 years ago

Someone on StackOverflow mentioned the problem might be related to SSL, no solution mentioned. though: "This problem is probably specific for a site and a specific SSL backend. Your Windows version uses OpenSSL while the Linux version uses NSS as the backend." http://stackoverflow.com/questions/28358557/curl-works-on-windows-server-but-not-on-linux-is-ssl-to-blame

millenniumproof commented 9 years ago

The curl error I get is: Failed to connect to api.crowdflower.com port 80: Connection timed out

alex9311 commented 9 years ago

Ok, I just opened up that port on the outbound and it was already open on inboud. Let me know! ( I think its working for me now)

millenniumproof commented 9 years ago

d (^-^) b

\ (^-^) / \ (^-^) / \ (^-^) / \ (^-^) / \ (^-^) /

alex9311 commented 9 years ago

awesome!

millenniumproof commented 9 years ago

Getting the results from CrowdFlower is working now. This is the sql to get ideas that have been accepted by crowdflower: $sql = "SELECT * FROM Idea WHERE id IN (SELECT id_idea FROM Screening_results_crowdflower WHERE accepted = 1)";

What is the full url for the images, I need it to show the images correctly on CrowdFlower.

alex9311 commented 9 years ago

In the database the image url is stored as /var/www/html/uploads/Weaver_bird.jpg. Since /var/www/html is the path to the web directory, the url for the image is http://54.93.120.201/uploads/Weaver_bird.jpg. Do you want me to implement the conversion to that URL?

alex9311 commented 9 years ago

in php it should be something like

$image_from_db = "/var/www/html/uploads/Weaver_bird.jpg"
$image_to_crowdflower = "http://54.93.120.201/".subtr($image_from_db,14)

I haven't tested that but it should be close

millenniumproof commented 9 years ago

The CrowdFlower job uses CML (CrowdFlower Markup Language) similar to HTML. I don't think there's a substring function. I think the best thing would be to save just the end part "Weaver_bird.jpg" in the database. I can then fill in the rest of the url in CrowdFlower.

alex9311 commented 9 years ago

Can't you just edit the entry in the code before you send it? In this snippet:

        $entry = array(  "created" => "" , "favorite_count" => "" , "id" => $new_idea_id , 
                        "image" => $clean_image_location , "lang" => "" , "retweet_count" => "" , 
                        "text" => "" , "text_description" => $clean_valueText , 
                        "title" => $clean_title ,"user" => "" );

Just switch $clean_image_location with a modified value? Or am I thinking about this wrong?

millenniumproof commented 9 years ago

Yeah, that would work. Just thought it would be good practice to have cleaner data. I'll change it later.

alex9311 commented 9 years ago

I think either way could be argued to be poor practice. Having full raw URL's to your own server means the database isnt portable to a new host. I'd say its bad practice to have raw IPs to yourself encoded anywhere. You make a good point about trying to use good practice though.

How about instead of hardcoding the IP in before sending it to crowdflower we dynamically get it? That way the app and DB could theoretically move to a new host. We can use the PHP variable $_SERVER[HTTP_HOST] for that.

So something like:

$image_from_db = "/var/www/html/uploads/Weaver_bird.jpg"
$image_to_crowdflower = $_SERVER[HTTP_HOST]."/".subtr($image_from_db,14)
millenniumproof commented 9 years ago

The full raw URL would only be in one place, the Crowdflower job, and not actually in any data.

But this is good, I'll change it now.

alex9311 commented 9 years ago

@millenniumproof It seems to me that crowdflower is working now, should I close this issue?