Dataset duplicated - Githubissues

Bontempogianpaolo1 commented 2 years ago

As you can see from the picture below each job contains a huge quantity of data. The reason is the presence of duplicates inside.

Bontempogianpaolo1 commented 2 years ago

federicacitarrella commented 2 years ago

The EricScript database size is 8.3 GB. It is saved in the work directory, which is just a temporary directory that saves all the files used in the pipeline once (in this case it contains the EricScript database and other smaller files needed for the execution). The EricScript database is also copied in the results directory to be directly used in subsequently pipeline execution. The work directory can be deleted after the pipeline execution, while the results directory contains all the output files and all the databases and files needed to execute again the pipeline skipping the downloader processes, so it can be deleted but, in this case, a new pipeline execution will run again the downloader processes. Why do you think there are duplicates?

Bontempogianpaolo1 commented 2 years ago

You are right. The work directory can be deleted after the pipeline execution but a normal user wouldn't do that.

A normal user could also remove the result folder and not the work folder. In this way, the same dataset will be stored N times inside the work folder. With this pull request #36, I think we can resolve easily this problem, simplifying the initial code too

federicacitarrella / FusionFlow

Dataset duplicated #29