cagomezt / MSR2016

0 stars 0 forks source link

Making the data availible online #16

Open alexeyza opened 8 years ago

alexeyza commented 8 years ago

Can we make the data available online (not the survey data , but the archival data selected for the sample)? Since it is publicly available data, it might be possible. Perhaps put a zip file on GitHub/Dropbox or something?

Even if we can't do it right now, but we may be able to do it by camera ready - then we can add a comment about that in the paper.

margaretstorey commented 8 years ago

did anyway email Carlos about this -- it is customary lately to see a comment that if the paper is accepted the data will be available online.

alexeyza commented 8 years ago

I just emailed Carlos about it

margaretstorey commented 8 years ago

Could we also post survey responses? The paper lacks discussion about the survey analysis and things like how many people said what.. there is lack of traceability for our findings from the survey...

alexeyza commented 8 years ago

Carlos responded to my email and said he is working on this.

I'm going to add a footnote in methodology that will say "Our sample data will be made openly available online for camera ready"

alexeyza commented 8 years ago

Update: I'm rephrasing it into "Our sample data will be openly available online by camera ready"

This way it doesn't sound like we have it already but insist on not publishing it

gpoo commented 8 years ago

Although I am all for making the data public and requested it as a note in the paper, I wonder if the ethic form submitted to HREB included making the survey responses public and/or the data aggregated.

alexeyza commented 8 years ago

Carlos mentioned the size of the data is 1 gigabyte. I recommended to try zipping it (as it helped with the original SO files).

Daniel suggested hosting it on his server.

dmgerman commented 8 years ago

or github under the chisel user. The only constraint is the size of uploaded files. One way to go around this is to create a multi-part zip file. I have seen other research groups do this.

gpoo commented 8 years ago

CHASE 2016 recommends zenodo.org and figshare.com for data preservation. Both of them provide a DOI, which could be cited in the paper instead of relaying in footnote.

alexeyza commented 8 years ago

That would be great if we can use one of these. @cagomezt or @gpoo can you see if we can use them? I won't have the time to try myself.

alexeyza commented 8 years ago

Have you tried uploading it to zenodo.org?

cagomezt commented 8 years ago

https://zenodo.org/record/47455

On 12 March 2016 at 16:10, Alexey Zagalsky notifications@github.com wrote:

Have you tried uploading it to zenodo.org?

— Reply to this email directly or view it on GitHub https://github.com/cagomezt/MSR2016/issues/16#issuecomment-195837297.

Best regards, Carlos Gómez

alexeyza commented 8 years ago

This is great Carlos, thanks for doing it!!

alexeyza commented 8 years ago

I noticed it's a bin file... does it require something specific to open/view it?

alexeyza commented 8 years ago

I've added the link in the paper. This should be good for the submission.

I just wonder if it would be clear to the reader on how to read/use the data (since it's a bin file). Perhaps we should mention on how to read/use it , in the zenodo website?

gpoo commented 8 years ago

@cagomezt Is it possible to rename the file to something more meaningful for anybody? Don't be afraid of long names ;-)

I noticed that the file is a PostgreSQL dump. Something like R-ML-and-StackOverflow-psql.dump may be clearer, and a note stating that was created with PostgreSQL 9.3.11.

cagomezt commented 8 years ago

We can't :(. According to the website, once the file is uploaded and published, you can't do anything else than change the metadata.

gpoo commented 8 years ago

Is it possible to remove it and create a new one?

I noticed that the tables ml_users and ml_mail expose the email addresses of the users, which is not really necessary, as the matches are done with the md5 column.

In European Union the email is considered personal data, and the site is in Europe, funded by the European Union.

Better safe than sorry.

cagomezt commented 8 years ago

No, I can't touch it. Once published I can only update the metadata. However, I can add a condition to download the data.

"Specify the conditions under which you grant users access to the files in your upload. User requesting access will be asked to justify how they fulfil the conditions. Based on the justification, you decide who to grant/deny access. You are not allowed to charge users for granting access to data hosted on Zenodo."

cagomezt commented 8 years ago

I closed the access to the file until I write a proper condition for the file.

gpoo commented 8 years ago

If you can close the access, then you could leave it closed and create a new one without personal data. How does this sound?

alexeyza commented 8 years ago

If this is not resolved by tomorrow, we can't submit the paper!!!!!

margaretstorey commented 8 years ago

Reviewer #3:

Footnote 3. I would prefer to see the data now than to be promised it later. To me that is part of the review.

Alexey: update - Carlos has uploaded the data go zenodo.org but then he closed the access ... so until it is fixed it is still unresolved.

Daniel: could you check that the data posted is ok? (Carlos once you have posted it please assign this to Daniel)

cagomezt commented 8 years ago

Sorry, electric damage kept me out of the internet until now. I deleted the personal data from the dump file and now I am uploading it to Zenodo. Moreover, I am adding a clause that protects the users just in case. The email's body may contain personal emails.

cagomezt commented 8 years ago

Data is now public: https://zenodo.org/record/47484

dmgerman commented 8 years ago

THe current one requires access (it is not open) so I could not test it. The paper is still using the old one. The old page should be deleted. Once it is made available i can download it and try to use it.

dmgerman commented 8 years ago

Daniel: could you check that the data posted is ok? (Carlos once you have posted it please assign this to Daniel)

i will once I get access to the data.

--dmg

Daniel M. German "As De Gaulle used to say: 'Aim well, shoot fast Henri Cartier Bresson -> and get the hell out.'" http://turingmachine.org/ http://silvernegative.com/ dmg (at) uvic (dot) ca replace (at) with @ and (dot) with .

cagomezt commented 8 years ago

I changed the access restriction, so@ DMG can take a look.

alexeyza commented 8 years ago

@dmgerman: Carlos has made it an open access. please check the file now: https://zenodo.org/record/47484

alexeyza commented 8 years ago

I made the needed changes in the paper (to point to the correct URL), I'm waiting to commit it after Cassie has finished her pass on the paper.

dmgerman commented 8 years ago

I was able to restore the database. It seems to be working... but... three issues.

  1. This file needs a readme that says what command to run to recreate the database. I had to read the manuals: for example, this command would create a database MLandSOF

pg_restore -Fc -C R-ML-and-StackOverflow-psql.bin | psql template1

  1. What does this database contain? It needs to be explained.
  2. Where are the 400 questions analyzed from each set? It does not seem to be there.

--dmg

gpoo commented 8 years ago

Considering that zenodo seems to allow only one dataset per upload, (1) and (2) can be addressed in the metadata information.

(3) will require a new upload, and a new DOI.

dmgerman commented 8 years ago

It would be better to create a zip file that contains everything (README, database dump and questions classification). Replicate the README in the metadata.

On Tue, Mar 15, 2016 at 8:42 AM, Germán Poo-Caamaño < notifications@github.com> wrote:

Considering that zenodo seems to allow only one dataset per upload, (1) and (2) can be addressed in the metadata information.

(3) will require a new upload, and a new DOI.

— Reply to this email directly or view it on GitHub https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196572634.

--dmg


Daniel M. German http://turingmachine.org

alexeyza commented 8 years ago

Can Carlos do the necessary changes? Or can you (Daniel) do them? On Mar 14, 2016 17:02, "dmgerman" notifications@github.com wrote:

It would be better to create a zip file that contains everything (README, database dump and questions classification). Replicate the README in the metadata.

On Tue, Mar 15, 2016 at 8:42 AM, Germán Poo-Caamaño < notifications@github.com> wrote:

Considering that zenodo seems to allow only one dataset per upload, (1) and (2) can be addressed in the metadata information.

(3) will require a new upload, and a new DOI.

— Reply to this email directly or view it on GitHub https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196572634.

--dmg


Daniel M. German http://turingmachine.org

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196576861

cagomezt commented 8 years ago

I can't touch the File once is uploaded. I can add all the instruction to the description of the file.

cagomezt commented 8 years ago

I just updates the Zenodo metadata. @alexeyza can you take a look and tell me if everything is ok.

dmgerman commented 8 years ago

Adding the description of the datbase and the how to use it is good enough in the metadata. But what about the classification data?

On Tue, Mar 15, 2016 at 9:40 AM, Carlos Arturo Gomez < notifications@github.com> wrote:

I just updates the Zenodo metadata. @alexeyza https://github.com/alexeyza can you take a look and tell me if everything is ok.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196587168

--dmg


Daniel M. German http://turingmachine.org

gpoo commented 8 years ago

@dmgerman The problem with zenodo is: you can save and update the data as many times as possible, but only before submission. However, without submission none of us would have access to the data, nor the DOI. Only the metadata can be updated.

That said, I looked at the zenodo web site, and you can link data sets through references. That is: @cagomezt could create a new data set that contains only the sample, and link it as part of the other data set already existing. And then, update the description of both data sets to make clear that those are related.

And the other part can be done either today or tomorrow, depending on Carlos availability. This will not affect the paper, neither the submission.

Does this make sense?

dmgerman commented 8 years ago

I think we have two problems that are independent, but the way we are implementing them are making them harder.

1. Create the dataset. Make sure we have all the files in one place, add a

README file.

2. Check this dataset to make sure it works.

3. Upload to where it is supposed to go.

On Tue, Mar 15, 2016 at 10:05 AM, Germán Poo-Caamaño < notifications@github.com> wrote:

@dmgerman https://github.com/dmgerman The problem with zenodo is: you can save and update the data as many times as possible, but only before submission. However, without submission none of us would have access to the data, nor the DOI.

That said, I looked at the zenodo web site, and you can link data sets through references. That is: @cagomezt https://github.com/cagomezt could create a new data set that contains only the sample, and link it as part of the other data set already existing. And then, update the description of both data sets to make clear that those are related.

And the other part can be done either today or tomorrow, depending on Carlos availability. This will not affect the paper, neither the submission.

Does this make sense?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196591960

--dmg


Daniel M. German http://turingmachine.org

dmgerman commented 8 years ago

can you send me the data for the classified questions?

On Tue, Mar 15, 2016 at 10:12 AM, dmg dmg@uvic.ca wrote:

I think we have two problems that are independent, but the way we are implementing them are making them harder.

1. Create the dataset. Make sure we have all the files in one place, add

a README file.

2. Check this dataset to make sure it works.

3. Upload to where it is supposed to go.

On Tue, Mar 15, 2016 at 10:05 AM, Germán Poo-Caamaño < notifications@github.com> wrote:

@dmgerman https://github.com/dmgerman The problem with zenodo is: you can save and update the data as many times as possible, but only before submission. However, without submission none of us would have access to the data, nor the DOI.

That said, I looked at the zenodo web site, and you can link data sets through references. That is: @cagomezt https://github.com/cagomezt could create a new data set that contains only the sample, and link it as part of the other data set already existing. And then, update the description of both data sets to make clear that those are related.

And the other part can be done either today or tomorrow, depending on Carlos availability. This will not affect the paper, neither the submission.

Does this make sense?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196591960

--dmg


Daniel M. German http://turingmachine.org

--dmg


Daniel M. German http://turingmachine.org

cagomezt commented 8 years ago

I am not at home right now.

dmgerman commented 8 years ago

See this:

https://github.com/dmgerman/R-ML-and-StackOverflow

This repo can be transferred to Chisel and potentially archived in zenodo, see

https://guides.github.com/activities/citable-code/

At least we will have a URL for the paper to add.

On Tue, Mar 15, 2016 at 10:35 AM, Carlos Arturo Gomez < notifications@github.com> wrote:

I am not at home right now.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196600802

--dmg


Daniel M. German http://turingmachine.org

cagomezt commented 8 years ago

I know that it is a little bit late, but I feel that is better said this now than later. I was reviewing the ethics that I have for this study 14-313, and I found that all the information should be anonymized. However, the body of the emails still have some personal information that I could not delete given the time available. Additionally, users can be identified using the the text of their emails. is it still OK if we publish the information like that?

I don't want to have problems with the Uvic's ethics committee

alexeyza commented 8 years ago

Sorry, I was away teaching my labs until now.

I talked to Peggy, the paper is ready. I'm planning to give it a quick read and try to submit soon.

Has this been resolved? What link should use I for the online data?

dmgerman commented 8 years ago

Hi Alexey,

the best solution is to clone my repo to chisel, then I'll delete mine and use the URL of the repo. Then we can solve the issue of the methods classification. Using zenodo is just a pain if we need to update the data.

We should also involve Peggy on this to see what she thinks.

On Tue, Mar 15, 2016 at 11:49 AM, Alexey Zagalsky notifications@github.com wrote:

Sorry, I was away teaching my labs until now.

I talked to Peggy, the paper is ready. I'm planning to give it a quick read and try to submit soon.

Has this been resolved? What link should I for the online data?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196626289

--dmg


Daniel M. German http://turingmachine.org

alexeyza commented 8 years ago

She is out... I don't know if she will reply anytime soon.

dmgerman commented 8 years ago

Then let us clone the repo into chisel (I can't, because I don't have the rights). And use the github address in the paper. That would be ok. We can then link to a zenodo DOI when we have it all sorted out.

On Tue, Mar 15, 2016 at 11:55 AM, Alexey Zagalsky notifications@github.com wrote:

She is out... I don't know if she will reply anytime soon.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196628964

--dmg


Daniel M. German http://turingmachine.org

alexeyza commented 8 years ago

I might be able to clone it into CHISEL, but it will have to be a public repo.

On Mon, Mar 14, 2016 at 7:57 PM, dmgerman notifications@github.com wrote:

Then let us clone the repo into chisel (I can't, because I don't have the rights). And use the github address in the paper. That would be ok. We can then link to a zenodo DOI when we have it all sorted out.

On Tue, Mar 15, 2016 at 11:55 AM, Alexey Zagalsky < notifications@github.com> wrote:

She is out... I don't know if she will reply anytime soon.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196628964

--dmg


Daniel M. German http://turingmachine.org

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196629228

cagomezt commented 8 years ago

I know that it is a little bit late, but I feel that is better said this now than later. I was reviewing the ethics that I have for this study 14-313, and I found that all the information should be anonymized. However, the body of the emails still have some personal information that I could not delete given the time available. Additionally, users can be identified using the the text of their emails. is it still OK if we publish the information like that?

I don't want to have problems with Uvic's ethics committee

dmgerman commented 8 years ago

It must be a public repo. So it is ok.

On Tue, Mar 15, 2016 at 11:59 AM, Alexey Zagalsky notifications@github.com wrote:

I might be able to clone it into CHISEL, but it will have to be a public repo.

On Mon, Mar 14, 2016 at 7:57 PM, dmgerman notifications@github.com wrote:

Then let us clone the repo into chisel (I can't, because I don't have the rights). And use the github address in the paper. That would be ok. We can then link to a zenodo DOI when we have it all sorted out.

On Tue, Mar 15, 2016 at 11:55 AM, Alexey Zagalsky < notifications@github.com> wrote:

She is out... I don't know if she will reply anytime soon.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196628964

--dmg


Daniel M. German http://turingmachine.org

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196629228

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/cagomezt/MSR2016/issues/16#issuecomment-196629395

--dmg


Daniel M. German http://turingmachine.org