KosinskiLab / AlphaPulldown

https://doi.org/10.1093/bioinformatics/btac749
GNU General Public License v3.0
193 stars 46 forks source link

rename_colab_search_a3m.py does something unintended #378

Open IwanParf opened 1 month ago

IwanParf commented 1 month ago

Hello dear Alphapulldown developers. I am trying to set up the colabfold part for alphapulldown. I got my *.a3m files in a directory and I run rename_colab_search_a3m.py to get more meaningful filenames. however when doing so, I get this output:


(AlphaPulldown) [....]$ rename_colab_search_a3m.py

Renaming 19.a3m to 101.a3m
Renaming 15.a3m to 101.a3m
Renaming 2.a3m to 101.a3m
Renaming 13.a3m to 101.a3m
Renaming 12.a3m to 101.a3m
Renaming 5.a3m to 101.a3m
Renaming 20.a3m to 101.a3m
Renaming 7.a3m to 101.a3m
Renaming 6.a3m to 101.a3m
Renaming 17.a3m to 101.a3m
Renaming 4.a3m to 101.a3m
Renaming 11.a3m to 101.a3m
Renaming 9.a3m to 101.a3m
Renaming 10.a3m to 101.a3m
Renaming 16.a3m to 101.a3m
Renaming 3.a3m to 101.a3m
Renaming 14.a3m to 101.a3m
Renaming 1.a3m to 101.a3m
Renaming 0.a3m to 101.a3m
Renaming 8.a3m to 101.a3m
Renaming 18.a3m to 101.a3m

Effectively, all files are renamed to the same name so that in the end only one file is left with the name 101.a3m, supposedly previous 18.a3m. When I look at the code of that python script, it loops over the .a3m files and performs the function get_first_seq_name() on them (line 40-41):

for file in glob.glob("*.a3m"):
    name = get_first_seq_name(file)

Which does not make sense. It is supposed to loop over the present fasta file in that directory, extract the sequence names and then loop over the a3m files to rename them, right? If I check the head of the first .a3m with:

head 0.a3m
>101
MESAIAEGGASRFSASSGGGGSRGAPQHYPKTAGN [......]

So ,yes, the script take the first fasta header from the first a3m and renames the a3m file by that one header. I am unfortunately a noob with python and chatGPT is not helping either, so I can't fix this and submit a pull request. Can someone please double check whether rename_colab_search_a3m.py does what it is supposed to do?

Best regards,

Iwan

jkosinski commented 1 month ago

Hi @IwanParf - it used to work with colab_search, which saved the real sequence name as the first sequence. That name could then be used to name the files. Perhaps this changed and now only the sequence is always named 101. What were the exact procedures and commands you used to generate the alignments?

IwanParf commented 1 month ago

Hey @jkosinski, thanks for the speedy answer! I see, it might be an issue with my setup then? Somebody else within my organization set up colabfold (version 1.5.5) and I am running this part within a SLURM script:

# load miniconda
module load miniconda3

# activate ColabFold environment
conda activate /projects/abv/ALPHAFOLD/COLABFOLD/conda_env/colabfold

# run MSAs on input fasta file
colabfold_search --db-load-mode 0 all_sequences.fasta /projects/abv/ALPHAFOLD/COLABFOLD/MsaServer/databases msas

I don't see a conda_env folder in the colabfold github, so this ColabFold environment might be a custom thing that person set up for ease of use. Possibly. I will check with them! Thank you!

IwanParf commented 1 month ago

Okay, I think I solved it. In case that this actually did change in colabfold and other users end up here, here's the code to rename a3m files according to a fasta file in the same directory. Simply save the following code as a xxx.py file and run it with python3 xxx.py:

#!/usr/bin/env python3

import os
import re

def get_sequence_headers(fasta_file):
    sequence_headers = []
    with open(fasta_file) as f:
        for line in f:
            if line.startswith(">"):
                sequence_headers.append(line[1:].strip())
    return sequence_headers

fasta_files = [file for file in os.listdir() if file.endswith(".fasta")]
if len(fasta_files) != 1:
    print("There should be exactly one fasta file in the directory.")
else:
    fasta_file = fasta_files[0]
    a3m_files = sorted([file for file in os.listdir() if file.endswith(".a3m")], key=lambda x: int(re.search(r'\d+', x).group()))

    sequence_headers = get_sequence_headers(fasta_file)

    if len(sequence_headers) != len(a3m_files):
        print("Number of sequence headers and number of a3m files do not match.")
    else:
        for i in range(len(a3m_files)):
            new_a3m_file = sequence_headers[i] + ".a3m"
            os.rename(a3m_files[i], new_a3m_file)
            print(f"Renamed {a3m_files[i]} to {new_a3m_file}")

It reorders the a3m files by the numeric value in their names, i.e. 0, 1, 2, 3 etc and not by interpreting the names as character, ie 0, 1, 10, 11, 12, etc and then renames them after the sequence headers after > in the fasta file. Kudos to chatGPT. If anybody cares, please correct me if this is wrong, sure was in the past...

@jkosinski, thank you for your help!

jkosinski commented 1 month ago

Great you fixed it! I leave the issue open as we might need to update our script or contact colabfold developers to change their naming convention.