MS20190155 / Measuring-Corporate-Culture-Using-Machine-Learning

Code Repository for MS20190155
141 stars 97 forks source link

Issue with the parsing script #2

Closed meyerhj closed 3 years ago

meyerhj commented 4 years ago

Dear Maifeng,

thank you for the great and easily understandable implementation of the code used in your paper. I have one question/suggestion:

When I run parse.py, the script fails on every line because corpus_preprocessor is not defined in process_line. Somehow the script does not manage to transfer the corpus_preprocessor method to the process_line function. While I can rewrite the script to access NLP outside of the process_line function this seems to be an unelegant workaround. Am I overlooking anything? I have installed all the requisite packages. I have also checked the fixes proposed within the closed issue, however they do not seem to help.

Some additional info:

When I just throw in a quick print(print(corpus_preprocessor.process_document) call directly behind the initial code (see below) it seems to work fine, even for individual strings that are in my files. So the issue doesnt seem to derive from CoreNLP

   with CoreNLPClient(be_quiet=False,
            properties={
                "ner.applyFineGrained": "false",
                "annotators": "tokenize, ssplit, pos, lemma, ner, depparse",
            },
            memory=global_options.RAM_CORENLP,
            threads=global_options.N_CORES,
            timeout=120000000,
            max_char_length=1000000,
    ) as client:
        corpus_preprocessor = preprocess.preprocessor(client)

        print(corpus_preprocessor.process_document(string, id))

Edit:

The issue seems to be stemming from the usage of starmap in my case. If I call the process_line function with a regular call, I do not have the same issue.

Best wishes Hauke

maifeng commented 3 years ago

Windows has a different mechanism for multiprocessing, see this thread.

I have not figured out how to integrate multiprocessing with CoreNLP on Windows yet. So I made the following change in the code:

        if global_options.WINDOWS is False:
            with Pool(global_options.N_CORES) as pool:
                for output_line, output_line_id in pool.starmap(
                    function_name, zip(next_n_lines, next_n_line_ids)
                ):
                    output_lines.append(output_line)
                    output_line_ids.append(output_line_id)
        else:        
            for output_line, output_line_id in map(
                function_name, next_n_lines, next_n_line_ids
            ):
                output_lines.append(output_line)
                output_line_ids.append(output_line_id)

I think this way the CoreNLP Java server still utilizes multiple CPUs, but the Python code for processing the results is single-thread. The compromise works for now and I am interested in learning other workarounds.

maifeng commented 3 years ago

Just confirmed that it is not a Windows-specific issue. Starmap works for Python 3.6&3.7 but not for 3.8. So replacing starmap with map should work.

maifeng commented 3 years ago

Added parse_parallel.py to support starmap on Python 3.8.