getzlab / dalmatian

dalmatian is a collection of high-level companion functions for Firecloud and FISS.
18 stars 13 forks source link

uploading samples taking 8 cores for 30mn #30

Closed jkobject closed 4 years ago

jkobject commented 4 years ago

dm v0.0.17:

Reuploading samples after a quick change is doing a parallel upload over all of my cpus.

It takes around 30mn tu complete. the dataframe is not that big: 1500x200.

Is there any way to speed up this process as it use to be?

Best,

agraubert commented 4 years ago

The slowdown is likely due to Hound provenance records being populated. Since you have observed all cores being used, it is likely that Hound is switching to parallel background updates, which only block if you try to exit python before the records finish being written.

If you do not need Hound records (logs of each individual change to workspace entities) you can call WorkspaceManager.disable_hound() which will skip record population. Assuming that Hound is responsible for the slowdown, this should return to previous performance.

If you do wish to keep the provenance records, then there is no way around the slowdown. However, the batch uploads are highly efficient and run in the background, allowing your script to accomplish other tasks while the records are uploaded.