KosinskiLab / AlphaPulldown

https://doi.org/10.1093/bioinformatics/btac749
GNU General Public License v3.0
176 stars 39 forks source link

create_indvidual_features.py #370

Open poojaparameswaran99 opened 1 week ago

poojaparameswaran99 commented 1 week ago

I am attempting to parse in a fasta file for the argument --fasta_paths with the following format:

>Q8C0M9
MACARGTVAPPVRASIDVSLVVVVHGGGASNISANRKELVREGIARAATEGYKILKAGGSAVDAVEGAVTVLENDPEFNAGYGSVLNVNGDIEMDASIMDGKDLSAGAVSAVRCIANPVKLARLVMEKTPHCFLTGHGAEKFAEDMGIPQVPVEKLITERTKKHLEKEKLEKGAQNADCPKNSGTVGAVALDCRGNLAYATSTGGIVNKMVGRVGDSPCIGAGGYADNNLGAVSTTGHGESILKVNLARLALFHVEQGKTVEEAAQLALDYMKSKLKGLGGLILVNKTGDWVAKWTSASMPWAAVKNGKLQAGIDLCETRTRDLPC
>Q6NXK8
MELKTEEEEVGGVQPVSIQAFASSSTLHGLAHIFSYERLSLKRALWALCFLGSLAVLLCVCTERVQYYFCYHHVTKLDEVAASQLTFPAVTLCNLNEFRFSQVSKNDLYHAGELLALLNNRYEIPDTQMADEKQLEILQDKANFRSFKPKPFNMREFYDRAGHDIRDMLLSCHFRGEACSAEDFKVVFTRYGKCYTFNSGQDGRPRLKTMKGGTGNGLEIMLDIQQDEYLPVWGETDETSFEAGIKVQIHSQDEPPFIDQLGFGVAPGFQTFVSCQEQRLIYLPSPWGTCNAVTMDSDFFDSYSITACRIDCETRYLVENCNCRMVHMPGDAPYCTPEQYKECADPALDFLVEKDQEYCVCEMPCNLTRYGKELSMVKIPSKASAKYLAKKFNKSEQYIGENILVLDIFFEVLNYETIEQKKAYEIAGLLGDIGGQMGLFIGASILTVLELFDYAYEVIKHRLCRRGKCQKEAKRNSADKGVALSLDDVKRHNPCESLRGHPAGMTYAANILPHHPARGTFEDFTC

I have about 9k items like this, but when I run create_individual_features.py to create the monomer outputs, only a minuscule 20 items are getting parsed and output. Why is this so? Is there a maximum amt of AA allowed for each accession? Is the format incorrect?

Thank you in advance!

DimaMolod commented 1 week ago

Hello @poojaparameswaran99, and thanks for your interest in AlphaPulldown! To generate features for multiple sequences, you could either save each individual sequence in a separate fasta file and provide the comma-separated list of files as --fasta_paths=Q8C0M9.fasta,Q6NXK8.fasta,... or save all the sequences in a single fasta file. In your case, I recommend the second option. Then, you could generate the features with the command: sbatch run_create_individual_features.sh --array=1-9000 (with the slurm script like usual and --fasta_paths=<your_file.fasta> (the rest of the flags)

soderling-lab commented 1 week ago

Hi @DimaMolod thank you for your response! That is how I have it now, with the format as:

 >accession1
AASEQUENCEASFOLLOWS1
>accession2
AASEQUENCEASFOLLOWS2

But for some reason not all of the accessions are being saved, only a small handful are. In any case I will try again. I do not specify the --array parameter, so that may be the issue, I will look into it and follow up in event of complications.

I noticed you indicate to run a shell script .sh, I am running the .py file directly. Perhaps this could be the issue?

Thank you!