Presence and Absence File Error

pthieringer commented 1 year ago

Hello!

I have recently installed Coinfinder and have been running into issues with getting it to run. Below is the code I am using to run the program.

coinfinder -i genes_presence_absence_coinfinder.txt -p Marine_AOA_iqtree.treefile -o /COINFINDER_OUTPUT/Marine_AOA_associate -a -m

However, I run into the following error related to my gene presence/absence file. It's hard for me to tell what might be formatted incorrectly with the file I am providing? I have a tab delimited list of genes and the MAGs they are present in. Though I am not sure what might be causing the error.

`Reading arguments...

CORRECTION ······· = BONFERRONI METHOD ··········· = COINCIDENCE ALT_HYPOTHESIS ··· = GREATER MAX_MODE ········· = ACCOMPANY SET_MODE ········· = FULL PERMIT_FILTER ···· = NO VERBOSE ·········· = NO OUTPUT_ALL ······· = NO FRACTION CUTOFF ·· = N/A SIGNIFICANCE_LEVEL = 0.05 COMBINED_FILE ···· = genes_presence_absence_coinfinder.txt GENE_NAME ······· = Genes GENOME_NAME ········ = Genomes Formating input into gene_p_a for input into coinfinder... ERROR MESSAGE FROM Python: Traceback (most recent call last): File "/home/pthierin/miniconda3/envs/COIN/bin//coinfind-code/create_roary.py", line 69, in x = gen_hash[gen] KeyError: ''`

Thank you for your time and advice!

fwhelan commented 1 year ago

Hi Patrick,

Thanks for using coinfinder! Could you please paste the first few lines of your genes_presence_absence_coinfinder.txt file here? (e.g. head -n 4 genes_presence_absence_coinfinder.txt). I think this is just a small formatting issue.

pthieringer commented 1 year ago

Sure thing! Here are the first few lines of my file:

GC_00000001 AAA007_O23 GC_00000001 AAA282_K18 GC_00000001 AAA286_D17 GC_00000001 AAA287_E17

As an update, I think I was able to figure out what I think was causing the issue with a little help from a colleague. It seems that somehow when Coinfinder takes a tab-delimited list and tries to convert it into a Roary format it adds a blank row at the top of the sorted.tmp file.

Our solution was to hardcode a little bit into the create-rotary.py file by adding the below to the python code. EDIT: bold was not working so I marked the changes to the side with a comment.

`with open("sorted.tmp",'r') as f:

line = f.readline()

while True:

    if line=="\t\n":        #From here

        line=f.readline()

        next.          #To here is the updated code chunk

    if not line:

        break #EOF

    #Make/empty genomeloc array

    genomeloc = ["" for x in range(loc_len)]

    try:

        geneID = line.split("\t")[0]

    except:

        print("Cannot create geneID from line: " + line)

        exit()

    while (line.split("\t")[0] == geneID):

        #Get location for corresponding genome

        gen = (line.split("\t")[1]).strip()

        x = gen_hash[gen]

        #Append current entry to array

        if (genomeloc[x] == ""):

            genomeloc[x] = gen+"_"+geneID

        else:

            genomeloc[x] = genomeloc[x]+" "+gen+"_"+geneID

        line = f.readline()

    bulk = ("".join([',"{}"'.format(genomeloc[n]) for n in range(len(genomeloc))]))

    roary.write("\""+geneID+"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\",\"\""+bulk+"\n")`

I don't know if there's a simpler or more effective way of doing this, but I was able to get everything to run after this fix! Let me know if you need any additional information or code.

fwhelan commented 11 months ago

From the output of the top of your file, it looks like there are >1 gene/genome pair per line? or is that just github's formatting? Coinfinder would expect the file to look like: GC_00000001 AAA007_O23 GC_00000001 AAA282_K18 GC_00000001 AAA286_D17 GC_00000001 AAA287_E17

Where GC_00000001 is one of your genes and AAA007_O23 is one of your genomes.

fwhelan commented 11 months ago

But considering I got github to struggle to show what I wanted above, I imagine it's just a github formatting thing!

In which case... I can't see where a purposefully blank line would be written to the top of sorted.tmp in create_roary.py. I doubt there is much harm in just removing the blank line, but it might be worth us tracing it back to be sure that there isn'a bug that is, for e.g., removing a gene and leaving the blank line in it's place. I wonder if there might be a blank line at the bottom of your input file? My code is only smart enough to check that there are 2 columns of information in the first line of the input file; when it sorts on lines 31-5 if there was a blank line, perhaps an empty gene ID would be sorted to the top of sorted.tmp?

fwhelan commented 11 months ago

This little test suggests that's a possibility anyway! Let me know if this ends up being the issue and I'll improve the code to detect and remove blank lines.

pthieringer commented 11 months ago

Hi Fiona!

Thanks so much for the thorough reply! First, yes I think my copy and paste of the genes presence/absence file seems to have been formatted weirdly through Github. It is as you are expecting it to look with the genes in the first column and the genome/MAG names in the second column.

I just did a quick test using tail to see if there was a blank line at the end of the file....and there it was :)

So it does look like a blank line will be placed at the top and then cause the code to not run properly, but that was because of user error! Thanks for nailing down this issue, hopefully this will be an easy fix for others in the future if they run into this.

Thanks again for all the help and feedback!

fwhelan commented 11 months ago

Hi Patrick,

So great to hear this was an easy fix! I'm going to leave this issue open until I have a chance to improve the code so that it will spit this up as an easier-to-navigate error in the future.

Happy co-occurrencing!

fwhelan / coinfinder

Presence and Absence File Error #70