Closed millenniumproof closed 9 years ago
Yeah that's a good idea, that way we'll be testing crowdflower as well :) On 28 Feb 2015 11:37, "millenniumproof" notifications@github.com wrote:
I was thinking maybe we should also use CrowdFlower for content generation. We need a sizable amount of data before we can perform any significant Information Retrieval techniques on it. The more data we have the better we can test our platform. The job could be something simple like:
Write a one sentence description of a Mobile Application, existing or non-existing.
We could launch the job on monday and get a good idea of how CrowdFlower returns results.
— Reply to this email directly or view it on GitHub https://github.com/alex9311/TUD-Information-Retrieval-Group-02/issues/10 .
Yeah, sounds good!
Apparently the CrowdFlower part is more important than we expected so I'll put more time into making it dynamic. I'll try working out the cURL requests again and make a script that runs periodically sending new data to CrowdFlower. Gizem, if you want you can look into getting the results dynamically using webhooks. https://success.crowdflower.com/hc/en-us/articles/201856249-CrowdFlower-API-Webhooks You can use the php function: json_decode() to parse the JSON data.
I will check it out right away :)
On 19 March 2015 at 05:16, millenniumproof notifications@github.com wrote:
Apparently the CrowdFlower part is more important than we expected so I'll put more time into making it dynamic. I'll try working out the cURL requests again and make a script that runs periodically sending new data to CrowdFlower. Gizem, if you want you can look into getting the results dynamically using webhooks.
https://success.crowdflower.com/hc/en-us/articles/201856249-CrowdFlower-API-Webhooks You can use the php function: json_decode() to parse the JSON data.
— Reply to this email directly or view it on GitHub https://github.com/alex9311/TUD-Information-Retrieval-Group-02/issues/10#issuecomment-83311724 .
Okay I had a muuuuuch harder time understanding things than expected unfortunately :( Crowdflower seems to be just another program that's not well documented. So far, I have this php code that inserts the data from the JSON file (from Crowdflower's job results) to the database. I realize that you probably have this already Miriam, but I had to do it myself too to understand what was going on and what I was doing :S Now I have to figure out how to get the JSON file dynamically :S
I didn't want to put up the php file to github since it's far from being ready so I will just paste it here:
<?php $servername = "localhost"; $username = "root"; $password = "root"; $dbname = "Sparked";
// Create connection $conn = new mysqli($servername, $username, $password, $dbname); // Check connection if ($conn->connect_error) { die("Connection failed: " . $conn->connect_error); }
// Read the JSON file contents $jsondata = file_get_contents('job_702403.json');
// Convert JSON object to php associative array $data = json_decode($jsondata, true);
// Get the results $id_idea = $data['data']['id']; $accepted = $data['results']['judgments']['data']['should_this_idea_be_rejected']; $reason = $data['results']['judgments']['data']['why_should_this_idea_be_rejected'];
// Insert into the table $sql = "INSERT INTO Screening_result(id_Idea, accepted?, reason) VALUES('$id_idea', '$accepted', '$reason')";
$conn->close(); ?>
The script I wrote was to parse the csv file of the results. It's in the file 'send_ideas_from_csv_to_mysql.php', but to do it dynamically you have to parse from JSON. It's all new to me as well. We'll just try to figure it out as we go :) I'll look into the webhooks as well. So we can check each other's work.
One thing I notice looking at the code is that the data field in the database for 'accepted' is not text. So you have to change a 'Yes' to 1 and a 'No' to 0. You can look at my code.
Oh okay thanks!
On 19 March 2015 at 16:57, millenniumproof notifications@github.com wrote:
The script I wrote was to parse the csv file of the results. It's in the file 'send_ideas_from_csv_to_mysql.php', but to do it dynamically you have to parse from JSON. It's all new to me as well. We'll just try to figure it out as we go :) I'll look into the webhooks as well. So we can check each other's work.
One thing I notice looking at the code is that the data field in the database for 'accepted' is not text. So you have to change a 'Yes' to 1 and a 'No' to 0. You can look at my code.
— Reply to this email directly or view it on GitHub https://github.com/alex9311/TUD-Information-Retrieval-Group-02/issues/10#issuecomment-83642648 .
I got the dynamic sending of data to CrowdFlower working. One reason it didn't work at first was that the url given in the CrowdFlower api didn't work. When I changed the 'https' to 'http' it worked fine just had to fix the formatting a bit. I didn't have any problems sending data to a running job. We should decide whether we send and receive data in batches or row by row. I'll first try and get the webhook for receiving the results working.
sounds great! Would it be easiest to send the data row by row? Specifically, we would send it when a user submits a new question. Does that sound right?
Are you sending it through php?
Yeah, it's a php script that sends a cURL request. It's pretty simple.
Sending row by row seems best to me as well. If you send periodically you would have to wait till you get the results so you wouldn't send the same data twice. Or you would have to adjust the database to make note if data has been sent, but that seems a bit cumbersome.
If we want to send row by row I would need the idea_ID which is (I think?) generated in the database. Any idea how to get the ID back when you insert a new idea in the database.
Yep! I just committed some fixes that includes getting the ID from an idea that was just inserted. See here on line 68-69
Great, I'll add the function to the 'submit_functions' script then.
Committed straight to master again, haha. I promise, I didn't break anything.
no worries, ive been doing the same thing haha
Oh nicee!! I'm sorry I couldn't be of much help in figuring out how to get the results dynamically :( On 22 Mar 2015 14:54, "Alex Simes" notifications@github.com wrote:
no worries, ive been doing the same thing haha
— Reply to this email directly or view it on GitHub https://github.com/alex9311/TUD-Information-Retrieval-Group-02/issues/10#issuecomment-84611006 .
No problem. The webcrawler stuff is awesome!
For some reason the code for sending data to crowdflower isn't working anymore. I haven't figured out what the problem is yet.
Was it working on the server before?
I'm running the code that I used for testing before. I didn't change anything. The same code that worked before doesn't work anymore. Not getting any error, seems like it times out when trying to connect to CrowdFlower. :S
Yeah but were you testing on the server before? The reason I ask is that it might be a port issue with the server. But if it was working on the server before we know its not that
Hmm, I thought I did test it on the server, but maybe not. I just tried running from local server and it works there. So maybe it is a port issue with the server.
Someone on StackOverflow mentioned the problem might be related to SSL, no solution mentioned. though: "This problem is probably specific for a site and a specific SSL backend. Your Windows version uses OpenSSL while the Linux version uses NSS as the backend." http://stackoverflow.com/questions/28358557/curl-works-on-windows-server-but-not-on-linux-is-ssl-to-blame
The curl error I get is: Failed to connect to api.crowdflower.com port 80: Connection timed out
Ok, I just opened up that port on the outbound and it was already open on inboud. Let me know! ( I think its working for me now)
d (^-^) b
\ (^-^) / \ (^-^) / \ (^-^) / \ (^-^) / \ (^-^) /
awesome!
Getting the results from CrowdFlower is working now. This is the sql to get ideas that have been accepted by crowdflower: $sql = "SELECT * FROM Idea WHERE id IN (SELECT id_idea FROM Screening_results_crowdflower WHERE accepted = 1)";
What is the full url for the images, I need it to show the images correctly on CrowdFlower.
In the database the image url is stored as /var/www/html/uploads/Weaver_bird.jpg
. Since /var/www/html
is the path to the web directory, the url for the image is http://54.93.120.201/uploads/Weaver_bird.jpg
. Do you want me to implement the conversion to that URL?
in php it should be something like
$image_from_db = "/var/www/html/uploads/Weaver_bird.jpg"
$image_to_crowdflower = "http://54.93.120.201/".subtr($image_from_db,14)
I haven't tested that but it should be close
The CrowdFlower job uses CML (CrowdFlower Markup Language) similar to HTML. I don't think there's a substring function. I think the best thing would be to save just the end part "Weaver_bird.jpg" in the database. I can then fill in the rest of the url in CrowdFlower.
Can't you just edit the entry in the code before you send it? In this snippet:
$entry = array( "created" => "" , "favorite_count" => "" , "id" => $new_idea_id ,
"image" => $clean_image_location , "lang" => "" , "retweet_count" => "" ,
"text" => "" , "text_description" => $clean_valueText ,
"title" => $clean_title ,"user" => "" );
Just switch $clean_image_location with a modified value? Or am I thinking about this wrong?
Yeah, that would work. Just thought it would be good practice to have cleaner data. I'll change it later.
I think either way could be argued to be poor practice. Having full raw URL's to your own server means the database isnt portable to a new host. I'd say its bad practice to have raw IPs to yourself encoded anywhere. You make a good point about trying to use good practice though.
How about instead of hardcoding the IP in before sending it to crowdflower we dynamically get it? That way the app and DB could theoretically move to a new host. We can use the PHP variable $_SERVER[HTTP_HOST] for that.
So something like:
$image_from_db = "/var/www/html/uploads/Weaver_bird.jpg"
$image_to_crowdflower = $_SERVER[HTTP_HOST]."/".subtr($image_from_db,14)
The full raw URL would only be in one place, the Crowdflower job, and not actually in any data.
But this is good, I'll change it now.
@millenniumproof It seems to me that crowdflower is working now, should I close this issue?
I was thinking maybe we should also use CrowdFlower for content generation. We need a sizable amount of data before we can perform any significant Information Retrieval techniques on it. The more data we have the better we can test our platform. The job could be something simple like:
Write a one sentence description of a Mobile Application, existing or non-existing.
We could launch the job on monday and get a good idea of how CrowdFlower returns results.