Issue with paprica-combine_results.py

nkhadland commented 8 months ago

Hi Jeff,

Wasn't sure if your website forum was still operational so reposting here. Great program, and I'm excited to see my results. I successfully ran my own samples, but when I try to run the combine_results code, I end up with an error — specifically, it flags a nan value and throws a KeyError, which should be handled by the exception line in the code (see the copied source code below, line 256-259). However, it throws another error on line 248: "ValueError: cannot reindex on an axis with duplicate labels." Any ideas on what could be going wrong? I didn't want to try debug the source code without seeing if you had any ideas first. Thanks!

        try:
            unique_edge_abund.loc[seq,  temp_unique.loc[seq, 'global_edge_num']] += temp_unique.loc[seq, 'abundance_corrected']
        except KeyError:
            unique_edge_abund.loc[seq,  temp_unique.loc[seq, 'global_edge_num']] = temp_unique.loc[seq, 'abundance_corrected']

bowmanjeffs commented 8 months ago

Hmmm... thanks for flagging this as an issue! I haven't seen this before and definitely want to track it down. Unfortunately I'm out of the office until Friday. If you have time for a little troubleshooting, I recommend running from inside python in order to debug. I.e. exec(open('paprica-combine_results.py').read()) after setting the command line options in paprica-combine_results.py itself. I'll do some digging myself as soon as am able...

bowmanjeffs commented 8 months ago

@nkhadland are you running the Docker image? If not, what version of Python and Pandas do you have?

nkhadland commented 8 months ago

Hey Jeff,

No, but I did build a conda env for the install, so looks like I am running python3 3.12.0 and pandas 2.1.4.

bowmanjeffs commented 8 months ago

Okay... back in the office and can start looking at this in detail. I want to know where the nan value that triggered this issue is coming from - that shouldn't be an acceptable value for the variable "seq" in line 243. If you're able to run in debug mode can you identify the file (variable "f") that the script failed on and share here?

nkhadland commented 8 months ago

Hi Jeff,

This may be a mistake on my part but when I try to run the script locally in the directory with my files to debug, I get an error farther up on this line:

try: n = list(range(int(math.ceil(df_in.loc[index, 'nedge_corrected'])))) except ValueError: n = list(range(int(df_in.loc[index, 'nedge'])))

The error there is: KeyError: 'nedge_corrected' meaning it isn't finding that column in the data frame. Which doesn't make sense because when I inspect the data frame they should have those columns.

The only thing I changed was I copied the script to the working directory and changed the domain to bacteria.

Nathan Hadland

bowmanjeffs commented 8 months ago

Are you willing to share your input fasta files with me? I'll delete them after testing.

nkhadland commented 8 months ago

Here's 2 subset files I've been doing testing on and getting the error (I've tried other samples and have the same issue, so it isn't sample dependent) Test_Fasta.zip

bowmanjeffs commented 8 months ago

Interesting, I can't reproduce the issue on those files. That leads me to believe it's an issue with different pandas versions (you're running a more recent version). I'll work on that and hopefully it's a quick fix. In the meantime, in your conda environment roll back to pandas 1.1.5 an see if that solves the issue.

nkhadland commented 8 months ago

Hi Jeff,

I tried that, still getting the error. Are there other dependencies that could be causing the error? It doesn't seem like it. What python version are you using? I had to rollback to 3.9 to install pandas 1.1.5.

Maybe I should just try doing a fresh install.

Nathan

bowmanjeffs commented 8 months ago

3.9 should definitely work, that's what the Docker image uses. It is mysterious! A fresh install isn't a bad idea. If that doesn't work I suggest using the docker image just to finish your analysis, then working backward to do a proper install. I'll keep trying to replicate it (no luck so far).

bowmanjeffs commented 8 months ago

I haven't been able to reproduce your error even with pandas >2.0. I did clean up some code in that portion of the script though so give it a try and let me know if the problem is resolved...

nkhadland commented 8 months ago

Hi Jeff,

Quick update. I used the new script and received the same error. Then, I installed the Docker to try that and again received the same error. Super bizarre -- I have no explanation. However, I just installed the Docker on a different machine and it worked, so at least I can continue with my analysis.

Just as an FYI the machine that was having issues was a 2013 Mac Pro running MacOS Monterey 12.6.7.

As an aside -- on your tutorial, it might be worth adding that infernal and epa-ng can be installed via conda :)

Thanks so much for the help.

Nathan

bowmanjeffs commented 8 months ago

Thanks Nathan, very strange and I think we can chalk it up to the advanced age of the original machine (can't imagine how though). Troubleshooting pointed me to a number of FutureWarnings that I need to deal with the make sure the code is ready for upcoming version of pandas and other dependencies, so this was a great exercise for me. I'll go ahead and close this issue but don't hesitate to reopen if needed.

bowmanjeffs / paprica

Issue with paprica-combine_results.py #99