MathOnco / NeoPredPipe

Neoantigens prediction pipeline for multi- or single-region vcf files using ANNOVAR and netMHCpan.
GNU Lesser General Public License v3.0
100 stars 28 forks source link

IndexError: list index out of range in NeoRecoPo step #30

Closed wt12318 closed 2 years ago

wt12318 commented 3 years ago

Hi, Sorry to bother again. I used following code to perform neoantigen recognition potential predictions:

NeoRecoPo.py -d --neopred_in=test.txt --neoreco_out=./ --fastas=./test

But some errors had happened:

INFO: Begin.
Traceback (most recent call last):
  File "/public/slst/home/wutao2/software/NeoPredPipe/NeoRecoPo.py", line 135, in <module>
    main()
  File "/public/slst/home/wutao2/software/NeoPredPipe/NeoRecoPo.py", line 98, in main
    preds.ConstructWTFastas()
  File "/slst/home/wutao2/software/NeoPredPipe/StandardPredsClass.py", line 195, in ConstructWTFastas
    self.__addToFastaFile()
  File "/slst/home/wutao2/software/NeoPredPipe/StandardPredsClass.py", line 169, in __addToFastaFile
    seqID, seq = self.__extractSeq(sam, fasta_head, epitopeLength)  # WT seqID and seq
  File "/slst/home/wutao2/software/NeoPredPipe/StandardPredsClass.py", line 269, in __extractSeq
    WTepiSeq = ExtractSeq(WT[1], pos, epitopeLength)
IndexError: list index out of range

And the tmp files are empty:

wc -l  NeoRecoTMP/*

0 NeoRecoTMP/TCGA-02-0047-01A.wildtype.tmp.10.fasta
0 NeoRecoTMP/TCGA-02-0047-01A.wildtype.tmp.9.fasta
0 total

The following are my input files:

test.txt The ./test dir has the two fasta files: TCGA-02-0047-01A.fasta.txt TCGA-02-0047-01A.reformat.fasta.txt

Thank you.

elakatos commented 3 years ago

Just as a warning, we never tested netMHCpan4.1 with NeoRecoPo, and since your netMHCpan version is used for predicting wild-type binding in NeoRecoPo, the format difference might cause problems in the processing.

That being said, it looks like the error you're getting is before this step. I think the issue is probably coming from your test.txt not being in the same format as NeoRecoPo would expect the output of the previous step (NeoPredPIpe) to be in. For example in your first line, I'm not sure what the last few values stand for following the Identity column - is the 0 in the end meaning the neoantigen is not novel, or is that the 1 before?: line462_NM_0153 0.003614 14569.59 8.548 1 0 NeoRecoPo would filter this out as not novel, but I'm not sure if that's what you'd intend. Your 7th line on the other hand looks more like how I'd expect the table: line462_NM_0153 0.120614 308.75 1.019 <= WB 1 If this 0 in the first line was introduced by some accident, I would suggest to first filter for only WB and SB neoantigens (either manually, or by not using the -a option in NeoPredPipe) before NeoRecoPo, that should ensure all rows have the same number of elements making it easier to bring the table to netMHCpan4.0 format.

Also, based on the beginning of your lines: TCGA-02-0047-01A 0 line462 it seems to me you've run NeoPredPipe with the -c option used in multi-region vcf files, even though TCGA files are usually single region. I don't think this will produce an error, but it would be more appropriate to not use the -c option.

wt12318 commented 3 years ago

Thank you for your reply. I tried to filter for only WB and SB neoantigens and remove the second column: image test.txt

and tried:

NeoRecoPo.py -d --neopred_in=test.txt --neoreco_out=./ --fastas=./test

NeoRecoPo.py -d --neopred_in=test.txt --neoreco_out=./ --fastas=./test
INFO: Begin.
Traceback (most recent call last):
  File "/public/slst/home/wutao2/software/NeoPredPipe/NeoRecoPo.py", line 135, in <module>
    main()
  File "/public/slst/home/wutao2/software/NeoPredPipe/NeoRecoPo.py", line 98, in main
    preds.ConstructWTFastas()
  File "/slst/home/wutao2/software/NeoPredPipe/StandardPredsClass.py", line 195, in ConstructWTFastas
    self.__addToFastaFile()
  File "/slst/home/wutao2/software/NeoPredPipe/StandardPredsClass.py", line 169, in __addToFastaFile
    seqID, seq = self.__extractSeq(sam, fasta_head, epitopeLength)  # WT seqID and seq
  File "/slst/home/wutao2/software/NeoPredPipe/StandardPredsClass.py", line 269, in __extractSeq
    WTepiSeq = ExtractSeq(WT[1], pos, epitopeLength)
IndexError: list index out of range

also the same problem.

And I look at the code, it seems that the seq_recode in this sample Neoantigen predicted file (like line462;NM_0153) are not in the fasta files:

>>> for seq_record in SeqIO.parse(preds.fastas[sample], 'fasta'):
...             seqIdentifier = ';'.join(seq_record.id.split(';',3)[0:2])[0:16]
...             if identifier in seqIdentifier:
...                 print("ok")
...             else:
...                 print("no ok")
... 
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok
no ok

So the WT list is empty, and the operation WT[1] produced IndexError in the __extractSeq function:

def __extractSeq(self, sample, identifier, epitopeLength):
        '''
        Extracts the sequence from the *.tmp.epi.fasta file and reverts the sequence back.

        :return: the wildtype sequence and header
        '''
        WT = []
        Mut = []
        count=0
        for seq_record in SeqIO.parse(self.fastas[sample], 'fasta'):
            seqIdentifier = ';'.join(seq_record.id.split(';',3)[0:2])[0:16]
            if identifier in seqIdentifier:
                count += 1

                if 'WILDTYPE' in seq_record.id.split(';')[2]:
                    WT.append(seq_record.id)
                    WT.append(seq_record)
                else:
                    try:
                        pos = int(seq_record.id.replace(";;", ";").split(";")[5]) - 1
                    except ValueError:
                        pos = int(seq_record.id.replace(";;", ";").split(";")[6]) - 1
                    Mut.append(seq_record.id)
                    Mut.append(seq_record)

                if count==2:
                    break

        WTepiSeq = ExtractSeq(WT[1], pos, epitopeLength)

        return(WT[0], WTepiSeq)

But I don't know how to solve this problem.

Thank you.

elakatos commented 3 years ago

Hi! I think your input (test.txt) looks good now, but indeed there seems to be something fishy going on regarding the fasta file and the neoantigen table. I haven't seen this before, based on the fasta file you've shared, you shouldn't have a neoantigen with ID (V19) line462_NM_0153 in the neoantigen table - it should be line462_NM_0181. I don't see how this could have happened, as in the NeoPredPipe step, the pipeline uses the fasta file to supply a name to netMHCpan (e.g. line462;NM_0153 )and that's the name put into column V19. So there shouldn't be a discrepancy between what is present in the fasta and what is found in the neoantigen table, regardless of netMHCpan version. Is it possible that you had several runs of NeoPredPipe, and the fasta files are from a slightly different run than the neoantigen table? (It can be up to random chance which out of two transcript identifiers (NM_18156 and NM_015378) is listed first and used in the ID, so two separate runs can differ in that and cause problems)

I would suggest making a new, clean run of NeoPredPipe (after deleting temporary files, like fastas, avannotated, etc. or in a separate folder). Then before running NeoRecoPo, first have a quick look by eye if the fasta headers and the neoantigen table IDs are the same - I don't see a reason why they wouldn't be.