Coherency between the folder structure in local and remote execution

DEIB-GECO / PyGMQL

Python Library for data analysis based on GMQL

Apache License 2.0

13 stars 5 forks source link

Coherency between the folder structure in local and remote execution #13

Closed MicheleRoar closed 6 years ago

MicheleRoar commented 6 years ago

------------ Comment added by Luca for clarity (PLEASE ADD BETTER DESCRIPTION NEXT TIME) ------------

When a query is performed locally with a statement like the following

result = dataset.materialize("/path/to/result/")

The results are stored in the /path/to/result/ path with the following structure:

/path/to/result/
    exp/
        S_00000.gdm
        S_00000.gdm.meta
        ...

while when downloading the results of a query done using the web interface the result structure is:

/path/to/result/
    info.txt
    query.txt
    vocabulary.txt
    files/
        S_00000.gdm
        S_00000.gdm.meta
        ...

and finally, when downloading the results of a remote query using the library the structure is the following:

/path/to/result/
    S_00000.gdm
    S_00000.gdm.meta
    ...

There must be a coherency between all the ways of downloading or generating datasets.

lucananni93 commented 6 years ago

I totally agree on the fact that there should be the same structure for the resulting dataset. Anyway I would like to discuss a little bit more what is the best structure.

I would propose to use the

/path/to/result/
    S_00000.gdm
    S_00000.gdm.meta
    ...

version because it seems the clearer one, but I would like to have also the opinion of @marcomass and @Sim1Pall8a because this modification affects (maybe) also the R API.

Please tell us your opinion.

marcomass commented 6 years ago

This issue was addressed also in GMQL issue #87 https://github.com/DEIB-GECO/GMQL/issues/87 My opinion is the following: to standardize the three modalities described by Michele we nee to consider the requirements of each of them. 1) Let's start from the "downloading the results of a query done using the web interface". In this case the structure has been defined by Arif (who I include here @acanakoglu) since it is required to download all included files and separate the sample and schema files from the others. The case 2), storing from API, generates a very similar structure, with only the subdirectory named exp/ instead of files. The case 3) pyGMQL does not include the subdirectory (since it only provides sample files).

If we adopt structure 3) for all cases, in cases 1) and 2) we would mix sample files with the other files, which I think it is much better to avoid (as Arif decided). So I would adopt 1) or 2) and since we have around several datasets already with structure 1), I would adopt it, just changing the subdirectory name in the API (see GMQL issue #87 https://github.com/DEIB-GECO/GMQL/issues/87 )

If we agree with this, who can do the API change? And together, the harmonization of the schema file name (as well indicated in GMQL issue #87 https://github.com/DEIB-GECO/GMQL/issues/87 ), i.e. close the GMQL issue #87 ?

acanakoglu commented 6 years ago

I discuss with Luca, and we decided as below. I took the zip structure as base structure, and we will correct the others with respect to that one.

Case 1(Creation of DS in Python or R interfaces): we will rename exp with files (which will be coherent with the zip file) and it will work correctly.
Case 2(Download from web interface as zip file): I will not change the zipping procedure, it will continue to create the structure as it is now.
Case 3(Download sample by sample from web interface): And PyGMQL and RGMQL interfaces will copy the files into a subdirectory(./files/).

If it is not clear please let me know.

Case 1 can be done @andreagulino or by me. case 3 should be done by @lucananni93 and @Sim1Pall8a

lucananni93 commented 6 years ago

@acanakoglu What is the situation of this issue?