Issue when several processes use mbed cache in parallel

adamelhakham commented 6 years ago

Hi, We have a Jenkins job (on Ubuntu 14.04 machine) that runs mbed deploy multiple times in parallel in different directories. In order to speed things up we would like to use the cache feature. However, occasionally, some of the processes run into the following issue:

[mbed] ERROR: Unknown Error: [('/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/objects/pack/pack-d51b52bd9a7c249c814d6dc49074452fe73c8e22.pack', '/sharedhome/adaelh01/work/parallel_cache_2/dir_8/mbed-os/.git/objects/pack/pack-d51b52bd9a7c249c814d6dc49074452fe73c8e22.pack', "[Errno 2] No such file or directory: '/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/objects/pack/pack-d51b52bd9a7c249c814d6dc49074452fe73c8e22.pack'"), ('/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/objects/pack', '/sharedhome/adaelh01/work/parallel_cache_2/dir_8/mbed-os/.git/objects/pack', "[Errno 2] No such file or directory: '/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/objects/pack'"), ('/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/objects/info', '/sharedhome/adaelh01/work/parallel_cache_2/dir_8/mbed-os/.git/objects/info', "[Errno 2] No such file or directory: '/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/objects/info'"), ('/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/objects', '/sharedhome/adaelh01/work/parallel_cache_2/dir_8/mbed-os/.git/objects', "[Errno 2] No such file or directory: '/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/objects'"), ('/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/branches', '/sharedhome/adaelh01/work/parallel_cache_2/dir_8/mbed-os/.git/branches', "[Errno 2] No such file or directory: '/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/branches'"), ('/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/refs', '/sharedhome/adaelh01/work/parallel_cache_2/dir_8/mbed-os/.git/refs', "[Errno 2] No such file or directory: '/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/refs'"), ('/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/info', '/sharedhome/adaelh01/work/parallel_cache_2/dir_8/mbed-os/.git/info', "[Errno 2] No such file or directory: '/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/info'"), ('/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/HEAD', '/sharedhome/adaelh01/work/parallel_cache_2/dir_8/mbed-os/.git/HEAD', "[Errno 2] No such file or directory: '/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/HEAD'"), ('/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/ORIG_HEAD', '/sharedhome/adaelh01/work/parallel_cache_2/dir_8/mbed-os/.git/ORIG_HEAD', "[Errno 2] No such file or directory: '/home/adaelh01/.mbed/cache-2/mbed-cache/github.com/ARMmbed/mbed-os/.git/ORIG_HEAD'")]
---

Currently, in order to avoid this issue we turn off the cache feature with mbed cache off Is the cache feature supposed to support parallel usage?

For convenience, I've added a simple shell script that reproduces the issue (calls mbed deploy in a loop from 8 different directories in parallel). It usually takes 10-20 minutes for one of the sub processes to produce the error:

set -e

function deploy_libs {
    while true; do
        cd "$1"
        rm -rf mbed-os
        if [ ! -f .mbed ]; then
            mbed new .
        fi

        mbed deploy -vvv
        cd ../
    done    
}

# Created dirs with .libs
for idx in 1 2 3 4 5 6 7 8
do
    mkdir "dir_$idx"
    touch "dir_$idx/mbed-os.lib"
    echo "git@github.com:ARMmbed/mbed-os.git#f9ee4e849f8cbd64f1ec5fdd4ad256585a208360" > "dir_$idx/mbed-os.lib"
done

# Deploy in parallel
for idx in 1 2 3 4 5 6 7 8
do
    deploy_libs "dir_$idx" &
done

wait

Thanks!

theotherjimmy commented 6 years ago

@screamerbg We should probably consider this a bug.

The python library fasteners contains interprocess locks to serialize among multiple processes for this purpose. We use them in the mbed-ls platform database.

screamerbg commented 6 years ago

@adamelhakham Thanks for the awesome and thorough bug report.

@theotherjimmy This is a problem of executing multiple mbed CLI in parallel, not interprocess.

I'll look into solution for this very soon.

theotherjimmy commented 6 years ago

@screamerbg

This is a problem of executing multiple mbed CLI in parallel, not interprocess.

That sentence is confusing me. Executing multiple CLIs in parallel is what an interprocess lock would protect against. Maybe you misread that as intraprocess?

trianglee commented 6 years ago

@screamerbg Is there an update on this issue? We are suffering terrible performance (and large network utilization) due to lack of cache in our automation tasks.

JanneKiiskila commented 6 years ago

This should get high priority, this will impact our Jenkins jobs quite heavily, too. We can't use the cache feature at all, until this is resolved.

screamerbg commented 6 years ago

@adamelhakham Could you try the f/thread_safetybranch on my fork - https://github.com/screamerbg/neo/tree/f/thread_safety ?

trianglee commented 6 years ago

@screamerbg Before @adamelhakham tries, could you confirm you managed to reproduce the original problem with his description, and can't reproduce it in the new branch?

screamerbg commented 6 years ago

@trianglee I can't reproduce it either and that's why I asked @adamelhakham to test it with my fork

trianglee commented 6 years ago

@screamerbg I see. Thanks. So let's verify, @adamelhakham.

adamelhakham commented 6 years ago

@screamerbg I ran the script I provided for a few hours and the issue did not appear. with the 1.5 mbed-cli it is reproduced within 10-20 minutes so it seems that the issue could have been successfully fixed. Next week I will integrate your branch into our CI so that we can it better.. I will keep you posted. Thanks!

screamerbg commented 6 years ago

@adamelhakham Great! Please let me know so we can plan a patch release with this fix.

adamelhakham commented 6 years ago

@screamerbg We now encounter the following error sometimes: [mbed] ERROR: Cache lock file exists with a different pid ("12822" vs "12727") Do you know why? Thanks!

jenia81 commented 6 years ago

@screamerbg can you help @adamelhakham with his question?

ciarmcom commented 6 years ago

ARM Internal Ref: MBOTRIAGE-446

adamelhakham commented 6 years ago

There is progress, @screamerbg has made some fixes. We are doing some further testing and will keep you posted

teetak01 commented 6 years ago

@screamerbg any idea when is this issue will be fixed? Still present in 1.8.0.

yogpan01 commented 6 years ago

@theotherjimmy @ARMmbed/mbed-os-maintainers This is an issue for 5.10 release and setup for client testing.

theotherjimmy commented 6 years ago

I was able to reproduce this with parallel-rust. Steps: run the reproducer in the issue top comment to generate the dir_{1..8} directories. make a reproducer.bash with the following contents:

set -e

cd $1
rm -rf mbed-os
if [ ! -f .mbed ]; then
    mbed new .
fi

mbed deploy -vvv
cd ..

run

parallel -v -j8 'bash reproducer.bash {}' ::: $(echo dir_*)

This will force ALL 8 mbed invocations to run at almost exactly the same time. One of the mbed-cli's will fail to cache correctly:

[mbed] WARNING: Unable to cache "/home/jimmy/temp/dir_8/mbed-os" to "/home/jimmy/.mbed/mbed-cache/github.com/ARMmbed/mbed-os"

Running with the changes from #752, I can't get that same line. Amusingly, it's also a bit quicker.

ARMmbed / mbed-cli

Issue when several processes use mbed cache in parallel #660