Testing how the algorithm scales

SylvainTakerkart commented 4 years ago

Hi,

Since the aim will be to work with very large directories, I think it would be good to test how the current version of the AnDOChecker works with large datasets.

We should create a dataset with 10 subjects and 500 sessions in each subject and see how long it takes to run. Of course, we need a script to create such large dataset!

Also, I don't think this dataset should be included in this repository... Let's think later on how to do this properly!

Slowblitz commented 4 years ago

Hi, Good idea ! I have made the test and made a little recap check out https://github.com/Slowblitz/AnDOChecker/blob/master/examples/README.md for further information .

SylvainTakerkart commented 4 years ago

Hi!

Good idea to make it as an example ;) So it seems to scale linearly with the number of subjects... (no surprise in fact, but it's good to see). Do you have any idea why the slope of the curve changes at 200?

Also: it might be more realistic to have few subjects and a large number of sessions...

Overall, I think that 0.5s for 500 subjects is not too bad, but should be better... Indeed, we do not have any checks on the name of the files for now, but it will probably come, therefore multiplying the number of checks... Can you think about ways to improve execution times?

Slowblitz commented 4 years ago

Hi ! First i ll will change the code to have few subjects and a large number of sessions. I have no idea why the curve change at 200, and to improve execution times we could try to improve this function :

def parse_all_path(nested_list_of_dir):
    """
    Transform this 
    [
        ['Landing', 'sub-anye', '180116_001_m_anye_land-001', 'source'],
        ['Landing', 'sub-enya', '180116_001_m_enya_land-001', 'source'],
        ['Landing', 'sub-enyo'], 
        ['Landing', 'sub-enyo', '180116_001_m_enyo_land-001']
    ]
    to 
    [
        ['Landing', 'sub-anye', '180116_001_m_anye_land-001', 'source'],
        ['Landing', 'sub-enya', '180116_001_m_enya_land-001', 'source'],
    ]
    Checking for the longest chain with the same sub chain
    """

    main_list = sorted(nested_list_of_dir, key= lambda sublist: len(sublist)) 

    # TODO : optimize 
    i=0
    j=1
    while i < len(main_list) -1:
        if j <= len(main_list) -1:
            if len(main_list[i]) <= len(main_list[j]):
                all_in = True
                for elmt in main_list[i]:
                    if elmt not in main_list[j]:
                        all_in = False

                if all_in:
                    for elmt_to_add in main_list[j]:
                        if elmt_to_add not in main_list[i]:
                            main_list[i].append(elmt_to_add)
                    main_list.pop(j)

                else:
                    if j < len(main_list) - 1:
                        j+=1
                    else:
                        i +=1
                        j =i+1
            else:
                i+=1
                j=i+1
        else:
            break
    return main_list

After all we can also try to multi treads the checker function as : Untitled Diagram By dividing the number of folder to check by the number of threads that we want.

SylvainTakerkart commented 4 years ago

Closing this one, the performances seem to be acceptable overall (the AnDOChecker web interface sends its output in ~10s on a 160GB dataset with lots of files)... We'll see in the future if we need to be even faster!

INT-NIT / BEP032tools

Testing how the algorithm scales #14