Run in parallel - Githubissues

ozak commented 5 years ago

Hi,

Is there a way to run this in parallel in a jupyter ipyparallel session. I need to perform the same operation on many files so I was planning on running it using multiple processes via ipyparallel. The issue I have is that it is not clear how to execute the cell magic %%stata on a client. Here's some code to get the idea

def DecodeLabels((z, f)):
    """ Decode variables in stata files
    """
    fin = pathout5 + z + '/' + f
    if not os.path.exists(pathout6 + z + '/'):
        os.mkdir(pathout6 + z + '/')
    fout = pathout6 + z +'/' + f
    StataCommand = """
    use "fin", clear
    ds, has(vallabel)
    foreach v of varlist `r(varlist)'{
        sdecode `v', replace
    }
    label drop _all
    compress
    save "fout", replace
    """
    get_ipython().run_cell_magic(u'stata', u'', StataCommand.replace("fin", fin).replace("fout", fout)
    return 0

jobs = []
for z in types:
    files = os.listdir(path + z + '/')
    for f in files:
        jobs.append((z, f))

results = view.map_async(DecodeLabels, jobs)

The code inside the function DecodeLabels works fine. But not in the parallel execution. Any ideas?

Thanks for the great package!

ozak commented 5 years ago

Some more info...the code runs, but it does not really seem to generate or process more than one input. So it is not really doing anything in parallel.

ozak commented 5 years ago

I tried another function to append many files and things are even worse, now it returns no errors, but does not seem to process even 1 file

def MergeNondecoded(z):
    files = os.listdir(pathout5 + z + '/')
    fin = pathout5 + z + '/' + files[0]
    identifier = 'gen filename = "' + files[0].replace('.dta', '').replace('.DTA', '') + '"'
    myappend = "\n".join(["qui append using " + '"' + pathout5 + z + '/' + f + '", force\ncapture drop s*\nreplace filename = "' + f.replace('.dta', '').replace('.DTA', '') + '" if filename==""\ndi "' + f + '"' for f in files[1:]])
    fout = pathout5 + z + ".dta"
    StataCommand = """
    set matsize 11000
    set maxvar 32000
    use "fin", clear
    identifier
    myappend
    compress
    save "fout", replace
    """
    StataCommand = StataCommand.replace('identifier', identifier)
    StataCommand = StataCommand.replace('myappend', myappend)
    StataCommand = StataCommand.replace('fout', fout).replace('fin', fin)
    get_ipython().run_cell_magic(u'stata', u'', StataCommand)
    return 0

results = view.map_async(MergeNondecoded, list(dfzip.DataInfo.unique()))

returns a vector of zeros, but no files are created.

ozak commented 5 years ago

But running

for z in dfzip.DataInfo.unique():
    MergeNondecoded(z)

works fine.

TiesdeKok commented 5 years ago

Hi Ozak!

Sorry for not getting back to you earlier, I just switched jobs so life is a bit hectic.

Running it in parallel probably doesn't work because I haven't programmed some of the temp files to be completely isolated, at least not for the "batch mode" functionality. I agree that this would be nice but to be honest it is a cost (i.e. my time) vs. benefit trade-off.

My recommendation would be to just interact with Stata directly without using ipystata. This is actually very simple as it only requires a couple of things to change:

You will have to save the "input" data you want to use as a .dta file and load them into your Stata code (instead of inputting them via ipystata). This is easy enough with Pandas.
If you desire some type of logging then you would have to add some standard Stata logging commands into your Stata code
Once you have the Stata code you want to run in a string you just save it to a .do file with Python and then use Python to execute that do file using the Stata command line interface.

You can see how I tell Python to run a .do file using the command line here: https://github.com/TiesdeKok/ipystata/blob/21049a4b0639aaf8cbda4a889cf4cd562c4b7d7d/ipystata/ipystata_magic_batch.py#L200-L217

Obviously you would have to tell it where the Stata executable is.

Does this help?

ozak commented 5 years ago

I think I get the idea. Thanks!

TiesdeKok / ipystata

Run in parallel #38