[SYNPY-1356] Updating syncToSynapse file dependency saving logic

BryanFauble commented 5 months ago

Draft items left to complete:

[x] Cleaning up code
[x] Adding in unit tests
[x] Adding in integration tests

Problem:

When saving a file that has executed/used references that are local files you need to save those files first to Synapse so a Synapse ID is generated for those files. It may be the case that the file does not need to be uploaded, however it still needs to go through the save process to determine if it does need to be uploaded, as well as resolve the ID of the file to use.
The logic around making this dependency graph was a bit complex to understand what was going on.

Solution:

Creating an AsyncIO task for all files that need to be saved. When a file is saved it first checks if there are any dependent tasks that it needs to wait on. If this is the case it uses the asyncio.wait statement to wait for the IDs of those files to be resolved. When those IDs are available it will then execute the save.

Testing:

Light manual testing with the following manifest TSV

I also tested with the upload benchmarking script

path    parent  used    executed
/home/bfauble/my_synapse_project/my_file_with_random_data_6.txt syn53144254 "/home/bfauble/my_synapse_project/my_file_with_random_data_2.txt;/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt;/home/bfauble/my_synapse_project/my_folder_4/my_file_with_random_data_44d8d920-4177-46f5-a242-825a61bb7306.txt;/home/bfauble/my_synapse_project/my_folder_8/my_file_with_random_data_38419077-fe04-47fb-b357-b1f8cc92c011.txt;/home/bfauble/my_synapse_project/my_folder_7/my_file_with_random_data_f8f21b81-3f42-4781-a3d9-66135fed7ccb.txt"
/home/bfauble/my_synapse_project/my_file_with_random_data_2.txt syn53144254 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_file_with_random_data_testing.txt   syn53144254 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt syn53144254
/home/bfauble/my_synapse_project/my_file_with_random_data_8.txt syn53144254 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_file_with_random_data_9.txt syn53144254 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_file_with_random_data_5.txt syn53144254 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_file_with_random_data_3.txt syn53144254 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_file_with_random_data_1.txt syn53144254 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_file_with_random_data_7.txt syn53144254 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_folder_4/my_file_with_random_data_44d8d920-4177-46f5-a242-825a61bb7306.txt  syn58654150 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_folder_8/my_file_with_random_data_38419077-fe04-47fb-b357-b1f8cc92c011.txt  syn58654158 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_folder_2/my_file_with_random_data_3863e158-9dcd-46db-bfda-a9f6e036d4e4.txt  syn58654145 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_folder_3/my_file_with_random_data_cc65d2b6-8d42-40b5-9c60-135e127574b7.txt  syn58654148 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_folder_9/my_file_with_random_data_d34836cc-5623-4271-a4a5-5937eb5bcc8e.txt  syn58654161 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_folder_1/my_file_with_random_data_ad8a85f9-63a0-4b6e-835a-8dcc3cd7d227.txt  syn53144356 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_folder_1/my_file_with_random_data_a35b6d49-c1cd-4a74-b5f6-af01641f2706.txt  syn53144356 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_folder_1/my_file_with_random_data_b954e345-ca0e-458a-a1ce-eaf0f0d5e333.txt  syn53144356 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_folder_1/my_file_with_random_data_df77acf1-3456-42c7-828f-aa1af73d9b35.txt  syn53144356 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_folder_6/my_file_with_random_data_184ad74e-b5de-4671-894c-41672a5f7753.txt  syn58654154 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_folder_5/my_file_with_random_data_f4e3f16b-2448-4bcc-a3e6-8b8809e820ae.txt  syn58654152 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"
/home/bfauble/my_synapse_project/my_folder_7/my_file_with_random_data_f8f21b81-3f42-4781-a3d9-66135fed7ccb.txt  syn58654156 "/home/bfauble/my_synapse_project/my_file_with_random_data_4.txt"

Assume the following file/dependency diagram. In this diagram we have to start at the nodes with no outbound edges and follow the graph backwards until all files are finally saved. The use of AsyncIO allows us to use futures/awaitables to handle this waiting logic.

flowchart  TD
    a-->b
    b-->c
    c-->d
    a-->e
    e-->f
    f-->d
    f-->h
    i-->e

Verified that the upload process progress bar looks good:

pep8speaks commented 5 months ago

Hello @BryanFauble! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file synapseutils/monitor.py:

Line 149:89: E501 line too long (89 > 88 characters)

In the file synapseutils/sync.py:

Line 710:89: E501 line too long (89 > 88 characters) Line 1193:89: E501 line too long (91 > 88 characters)

In the file tests/integration/synapseutils/test_synapseutils_sync.py:

Line 206:89: E501 line too long (117 > 88 characters)

Comment last updated at 2024-05-10 01:16:54 UTC

BryanFauble commented 5 months ago

Just wanted to say that this overall logic looks great. If you happen to get a chain of dependencies like this:
flowchart TD
    A --> B
    B --> C
    C --> D
    D --> E
Then it will upload those entities in serial, so just another thing to think about when people are uploading with syncToSynapse, even if async, it can run into "sync" like situations. <- Thinking out loud here - is that correct?

@thomasyu888 Yes, the chain of dependencies needing to be stored sequentially is correct. If there are really large dependency graphs we can make a small enhancement if the referenced file Metadata already exists in Synapse. As the only thing the next node in the graph needs to know is what the Synapse ID of the outbound edge is. However, the code isn't set up to "return early" to handle for this situation. It might be too early to pre-optimize for this, but something to think about if there are real world scenarios that might be adversely affected performance wise.

Sage-Bionetworks / synapsePythonClient

[SYNPY-1356] Updating syncToSynapse file dependency saving logic #1089

Comment last updated at 2024-05-10 01:16:54 UTC