Open alexeyza opened 8 years ago
did anyway email Carlos about this -- it is customary lately to see a comment that if the paper is accepted the data will be available online.
I just emailed Carlos about it
Could we also post survey responses? The paper lacks discussion about the survey analysis and things like how many people said what.. there is lack of traceability for our findings from the survey...
Carlos responded to my email and said he is working on this.
I'm going to add a footnote in methodology that will say "Our sample data will be made openly available online for camera ready"
Update: I'm rephrasing it into "Our sample data will be openly available online by camera ready"
This way it doesn't sound like we have it already but insist on not publishing it
Although I am all for making the data public and requested it as a note in the paper, I wonder if the ethic form submitted to HREB included making the survey responses public and/or the data aggregated.
Carlos mentioned the size of the data is 1 gigabyte. I recommended to try zipping it (as it helped with the original SO files).
Daniel suggested hosting it on his server.
or github under the chisel user. The only constraint is the size of uploaded files. One way to go around this is to create a multi-part zip file. I have seen other research groups do this.
CHASE 2016 recommends zenodo.org and figshare.com for data preservation. Both of them provide a DOI, which could be cited in the paper instead of relaying in footnote.
That would be great if we can use one of these. @cagomezt or @gpoo can you see if we can use them? I won't have the time to try myself.
Have you tried uploading it to zenodo.org?
https://zenodo.org/record/47455
On 12 March 2016 at 16:10, Alexey Zagalsky notifications@github.com wrote:
Have you tried uploading it to zenodo.org?
— Reply to this email directly or view it on GitHub https://github.com/cagomezt/MSR2016/issues/16#issuecomment-195837297.
Best regards, Carlos Gómez
This is great Carlos, thanks for doing it!!
I noticed it's a bin file... does it require something specific to open/view it?
I've added the link in the paper. This should be good for the submission.
I just wonder if it would be clear to the reader on how to read/use the data (since it's a bin file). Perhaps we should mention on how to read/use it , in the zenodo website?
@cagomezt Is it possible to rename the file to something more meaningful for anybody? Don't be afraid of long names ;-)
I noticed that the file is a PostgreSQL dump. Something like R-ML-and-StackOverflow-psql.dump
may be clearer, and a note stating that was created with PostgreSQL 9.3.11.
We can't :(. According to the website, once the file is uploaded and published, you can't do anything else than change the metadata.
Is it possible to remove it and create a new one?
I noticed that the tables ml_users
and ml_mail
expose the email addresses of the users, which is not really necessary, as the matches are done with the md5 column.
In European Union the email is considered personal data, and the site is in Europe, funded by the European Union.
Better safe than sorry.
No, I can't touch it. Once published I can only update the metadata. However, I can add a condition to download the data.
"Specify the conditions under which you grant users access to the files in your upload. User requesting access will be asked to justify how they fulfil the conditions. Based on the justification, you decide who to grant/deny access. You are not allowed to charge users for granting access to data hosted on Zenodo."
I closed the access to the file until I write a proper condition for the file.
If you can close the access, then you could leave it closed and create a new one without personal data. How does this sound?
If this is not resolved by tomorrow, we can't submit the paper!!!!!
Reviewer #3:
Footnote 3. I would prefer to see the data now than to be promised it later. To me that is part of the review.
Alexey: update - Carlos has uploaded the data go zenodo.org but then he closed the access ... so until it is fixed it is still unresolved.
Daniel: could you check that the data posted is ok? (Carlos once you have posted it please assign this to Daniel)
Sorry, electric damage kept me out of the internet until now. I deleted the personal data from the dump file and now I am uploading it to Zenodo. Moreover, I am adding a clause that protects the users just in case. The email's body may contain personal emails.
Data is now public: https://zenodo.org/record/47484
THe current one requires access (it is not open) so I could not test it. The paper is still using the old one. The old page should be deleted. Once it is made available i can download it and try to use it.
Daniel: could you check that the data posted is ok? (Carlos once you have posted it please assign this to Daniel)
i will once I get access to the data.
--dmg
Daniel M. German "As De Gaulle used to say: 'Aim well, shoot fast Henri Cartier Bresson -> and get the hell out.'" http://turingmachine.org/ http://silvernegative.com/ dmg (at) uvic (dot) ca replace (at) with @ and (dot) with .
I changed the access restriction, so@ DMG can take a look.
@dmgerman: Carlos has made it an open access. please check the file now: https://zenodo.org/record/47484
I made the needed changes in the paper (to point to the correct URL), I'm waiting to commit it after Cassie has finished her pass on the paper.
I was able to restore the database. It seems to be working... but... three issues.
pg_restore -Fc -C R-ML-and-StackOverflow-psql.bin | psql template1
--dmg
Considering that zenodo seems to allow only one dataset per upload, (1) and (2) can be addressed in the metadata information.
(3) will require a new upload, and a new DOI.
It would be better to create a zip file that contains everything (README, database dump and questions classification). Replicate the README in the metadata.
On Tue, Mar 15, 2016 at 8:42 AM, Germán Poo-Caamaño < notifications@github.com> wrote:
Considering that zenodo seems to allow only one dataset per upload, (1) and (2) can be addressed in the metadata information.
(3) will require a new upload, and a new DOI.
— Reply to this email directly or view it on GitHub https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196572634.
--dmg
Daniel M. German http://turingmachine.org
Can Carlos do the necessary changes? Or can you (Daniel) do them? On Mar 14, 2016 17:02, "dmgerman" notifications@github.com wrote:
It would be better to create a zip file that contains everything (README, database dump and questions classification). Replicate the README in the metadata.
On Tue, Mar 15, 2016 at 8:42 AM, Germán Poo-Caamaño < notifications@github.com> wrote:
Considering that zenodo seems to allow only one dataset per upload, (1) and (2) can be addressed in the metadata information.
(3) will require a new upload, and a new DOI.
— Reply to this email directly or view it on GitHub https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196572634.
--dmg
Daniel M. German http://turingmachine.org
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196576861
I can't touch the File once is uploaded. I can add all the instruction to the description of the file.
I just updates the Zenodo metadata. @alexeyza can you take a look and tell me if everything is ok.
Adding the description of the datbase and the how to use it is good enough in the metadata. But what about the classification data?
On Tue, Mar 15, 2016 at 9:40 AM, Carlos Arturo Gomez < notifications@github.com> wrote:
I just updates the Zenodo metadata. @alexeyza https://github.com/alexeyza can you take a look and tell me if everything is ok.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196587168
--dmg
Daniel M. German http://turingmachine.org
@dmgerman The problem with zenodo is: you can save and update the data as many times as possible, but only before submission. However, without submission none of us would have access to the data, nor the DOI. Only the metadata can be updated.
That said, I looked at the zenodo web site, and you can link data sets through references. That is: @cagomezt could create a new data set that contains only the sample, and link it as part of the other data set already existing. And then, update the description of both data sets to make clear that those are related.
And the other part can be done either today or tomorrow, depending on Carlos availability. This will not affect the paper, neither the submission.
Does this make sense?
I think we have two problems that are independent, but the way we are implementing them are making them harder.
README file.
On Tue, Mar 15, 2016 at 10:05 AM, Germán Poo-Caamaño < notifications@github.com> wrote:
@dmgerman https://github.com/dmgerman The problem with zenodo is: you can save and update the data as many times as possible, but only before submission. However, without submission none of us would have access to the data, nor the DOI.
That said, I looked at the zenodo web site, and you can link data sets through references. That is: @cagomezt https://github.com/cagomezt could create a new data set that contains only the sample, and link it as part of the other data set already existing. And then, update the description of both data sets to make clear that those are related.
And the other part can be done either today or tomorrow, depending on Carlos availability. This will not affect the paper, neither the submission.
Does this make sense?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196591960
--dmg
Daniel M. German http://turingmachine.org
can you send me the data for the classified questions?
On Tue, Mar 15, 2016 at 10:12 AM, dmg dmg@uvic.ca wrote:
I think we have two problems that are independent, but the way we are implementing them are making them harder.
1. Create the dataset. Make sure we have all the files in one place, add
a README file.
2. Check this dataset to make sure it works.
3. Upload to where it is supposed to go.
On Tue, Mar 15, 2016 at 10:05 AM, Germán Poo-Caamaño < notifications@github.com> wrote:
@dmgerman https://github.com/dmgerman The problem with zenodo is: you can save and update the data as many times as possible, but only before submission. However, without submission none of us would have access to the data, nor the DOI.
That said, I looked at the zenodo web site, and you can link data sets through references. That is: @cagomezt https://github.com/cagomezt could create a new data set that contains only the sample, and link it as part of the other data set already existing. And then, update the description of both data sets to make clear that those are related.
And the other part can be done either today or tomorrow, depending on Carlos availability. This will not affect the paper, neither the submission.
Does this make sense?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196591960
--dmg
Daniel M. German http://turingmachine.org
--dmg
Daniel M. German http://turingmachine.org
I am not at home right now.
See this:
https://github.com/dmgerman/R-ML-and-StackOverflow
This repo can be transferred to Chisel and potentially archived in zenodo, see
https://guides.github.com/activities/citable-code/
At least we will have a URL for the paper to add.
On Tue, Mar 15, 2016 at 10:35 AM, Carlos Arturo Gomez < notifications@github.com> wrote:
I am not at home right now.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196600802
--dmg
Daniel M. German http://turingmachine.org
I know that it is a little bit late, but I feel that is better said this now than later. I was reviewing the ethics that I have for this study 14-313, and I found that all the information should be anonymized. However, the body of the emails still have some personal information that I could not delete given the time available. Additionally, users can be identified using the the text of their emails. is it still OK if we publish the information like that?
I don't want to have problems with the Uvic's ethics committee
Sorry, I was away teaching my labs until now.
I talked to Peggy, the paper is ready. I'm planning to give it a quick read and try to submit soon.
Has this been resolved? What link should use I for the online data?
Hi Alexey,
the best solution is to clone my repo to chisel, then I'll delete mine and use the URL of the repo. Then we can solve the issue of the methods classification. Using zenodo is just a pain if we need to update the data.
We should also involve Peggy on this to see what she thinks.
On Tue, Mar 15, 2016 at 11:49 AM, Alexey Zagalsky notifications@github.com wrote:
Sorry, I was away teaching my labs until now.
I talked to Peggy, the paper is ready. I'm planning to give it a quick read and try to submit soon.
Has this been resolved? What link should I for the online data?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196626289
--dmg
Daniel M. German http://turingmachine.org
She is out... I don't know if she will reply anytime soon.
Then let us clone the repo into chisel (I can't, because I don't have the rights). And use the github address in the paper. That would be ok. We can then link to a zenodo DOI when we have it all sorted out.
On Tue, Mar 15, 2016 at 11:55 AM, Alexey Zagalsky notifications@github.com wrote:
She is out... I don't know if she will reply anytime soon.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196628964
--dmg
Daniel M. German http://turingmachine.org
I might be able to clone it into CHISEL, but it will have to be a public repo.
On Mon, Mar 14, 2016 at 7:57 PM, dmgerman notifications@github.com wrote:
Then let us clone the repo into chisel (I can't, because I don't have the rights). And use the github address in the paper. That would be ok. We can then link to a zenodo DOI when we have it all sorted out.
On Tue, Mar 15, 2016 at 11:55 AM, Alexey Zagalsky < notifications@github.com> wrote:
She is out... I don't know if she will reply anytime soon.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196628964
--dmg
Daniel M. German http://turingmachine.org
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196629228
I know that it is a little bit late, but I feel that is better said this now than later. I was reviewing the ethics that I have for this study 14-313, and I found that all the information should be anonymized. However, the body of the emails still have some personal information that I could not delete given the time available. Additionally, users can be identified using the the text of their emails. is it still OK if we publish the information like that?
I don't want to have problems with Uvic's ethics committee
It must be a public repo. So it is ok.
On Tue, Mar 15, 2016 at 11:59 AM, Alexey Zagalsky notifications@github.com wrote:
I might be able to clone it into CHISEL, but it will have to be a public repo.
On Mon, Mar 14, 2016 at 7:57 PM, dmgerman notifications@github.com wrote:
Then let us clone the repo into chisel (I can't, because I don't have the rights). And use the github address in the paper. That would be ok. We can then link to a zenodo DOI when we have it all sorted out.
On Tue, Mar 15, 2016 at 11:55 AM, Alexey Zagalsky < notifications@github.com> wrote:
She is out... I don't know if she will reply anytime soon.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196628964
--dmg
Daniel M. German http://turingmachine.org
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196629228
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196629395
--dmg
Daniel M. German http://turingmachine.org
Can we make the data available online (not the survey data , but the archival data selected for the sample)? Since it is publicly available data, it might be possible. Perhaps put a zip file on GitHub/Dropbox or something?
Even if we can't do it right now, but we may be able to do it by camera ready - then we can add a comment about that in the paper.