aqlaboratory / openfold

Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2
Apache License 2.0
2.82k stars 547 forks source link

Bug in Reading FASTAs for `precompute_alignments_mmseqs.py` #411

Open brucejwittmann opened 8 months ago

brucejwittmann commented 8 months ago

The way the FASTA file is loaded in precompute_alignments_mmseqs.py is very fragile. https://github.com/aqlaboratory/openfold/blob/ef0c9face788001b1624b3b2dbfa951072841835/scripts/precompute_alignments_mmseqs.py#L36-L40

As written, it assumes that the FASTA file is written with headers and sequences alternating every other line. Often, however, FASTA files are written with a header line (starting with ">") followed by multiple lines of sequence information, typically no more than 80 characters per line. Indeed, the BioPython implementation of writing sequences in FASTA format uses this multi-lined approach for long sequences.

For any FASTA file written with multi-line sequences, the current code will incorrectly treat some sequence lines as headers and some header lines as sequences.

Fortunately, this bug will only have an effect when setting fasta_chunk_size, as the default chunk size of len(seqs) results in an integer much larger the actual number of sequences in the FASTA file. Thus, the subsequent lines

https://github.com/aqlaboratory/openfold/blob/ef0c9face788001b1624b3b2dbfa951072841835/scripts/precompute_alignments_mmseqs.py#L51-L55

will just recreate the original FASTA file. If a chunksize was set, though, then some sequences at the beginning and end of the chunk may end up truncated; some sequence fragments at the beginning will also be missing header information and some headers at the end may also be missing sequence information.

I'd recommend just using BioPython to load in the FASTA files as it can handle both single-line and multi-line FASTA file formats. This would also simplify the chunking procedure. For instance, the sequences can be loaded with:

seqs = list(SeqIO.parse(args.input_fasta, "fasta"))

Chunking and saving can then be performed as below:

s = 0
while(s < len(seqs)):
    e = s + chunk_size
    chunk_fasta = seqs[s: e]
    s = e

    chunk_fasta_path = os.path.join(args.output_dir, "tmp.fasta")
    with open(chunk_fasta_path, "w") as f:
        SeqIO.write(chunk_fasta, f, "fasta"))
brucejwittmann commented 7 months ago

Bumping this as it actually causes a problem even when the chunksize is not set. Imagine a FASTA file with an odd number of lines in it:

>seq1
AAAAAAAAAAA
AAAAAA
>seq2
BBBBBBBBBBBB
BBBBBBBBBBBB
BBBB

If I run the code stepwise as is on this file, the last line will be cut off. Running through everything stepwise:

names = lines[::2]  >> [>seq1, AAAAAA, BBBBBBBBBBBB, BBBB]
seqs = lines[1::2] >> [AAAAAAAAAAA, >seq2, BBBBBBBBBBBB]

chunk_size = len(seqs) >> 3

s = 0
while (s < 3): # Explicitly writing "3" for "len(seqs)"
    e = s + 3 >> 3
    chunk_fasta = [el for tup in zip(names[0:3], seqs[0:3]) for el in tup] >> [>seq1, AAAAAAAAAAA, AAAAAA, >seq2, BBBBBBBBBBBB, BBBBBBBBBBBB] 
    s = e ## Breaks loop as s == 3

With an even number of lines, we don't lose the last line and there will be no error as, fortunately, the chunking process just puts everthing back together. With an odd number of lines, however, the query sequence in the alignments for the last sequence in the FASTA file will be truncated--there will thus be a mismatch between the sequence input to the structure prediction script via the FASTA file and query sequence in the alignments. I think that this may be related to issue #186 , as when I run prediction with these mismatches, I get a singularity-like structure that fails minimization with the error "Minimization failed after 100 attempts" and repeated printouts of "article coordinate is nan". The coordinates are likely "nan" because they do not exist in the alignment.