ScienceParkStudyGroup / studyGroup

Gather together a group to skill-share, co-work, and create community
https://www.scienceparkstudygroup.info
Other
6 stars 12 forks source link

Getting sequences from chromosomes fasta files with bedtools, WARNING Chr not found! #39

Closed malj390 closed 5 years ago

malj390 commented 5 years ago

Hello,

I trying to get some introns from the Human genome. I am using an intron.bed file where I have:

` head intron_ensembl.bed

4 74445407 74446533 AREG-201_intron_1 + 4 74446783 74449046 AREG-201_intron_2 + 4 74449249 74450379 AREG-201_intron_3 + 4 74450533 74452543 AREG-201_intron_4 + 4 74452656 74454758 AREG-201_intron_5 + 4 74445407 74446533 AREG-202_intron_1 + 4 74446783 74449046 AREG-202_intron_2 + 4 74449249 74450379 AREG-202_intron_3 + 4 74450533 74454758 AREG-202_intron_4 + 21 42366595 42363254 TFF1-201_intron_1 - ` And an example of the Chromosomes files:

`head 1.fa

1 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN`

This is the command that I'm running with bedtools:

bedtools getfasta -fi 4.fa -bed intron_ensembl.bed -fo introns_sequences.fasta -fullHeader

And the output error:

Warning: the index file is older than the FASTA file. WARNING. chromosome (4) was not found in the FASTA file. Skipping. WARNING. chromosome (4) was not found in the FASTA file. Skipping. WARNING. chromosome (4) was not found in the FASTA file. Skipping. WARNING. chromosome (4) was not found in the FASTA file. Skipping. WARNING. chromosome (4) was not found in the FASTA file. Skipping. WARNING. chromosome (4) was not found in the FASTA file. Skipping. WARNING. chromosome (4) was not found in the FASTA file. Skipping. WARNING. chromosome (4) was not found in the FASTA file. Skipping. WARNING. chromosome (4) was not found in the FASTA file. Skipping. Error: malformed BED entry at line 10. Start was greater than end. Exiting.

I already check about that start was greater than end and I specify the strand (so I guess I shouldn't modify anything else from those coordinates).

Thank you everybody in advance,

Miguel

mgalland commented 5 years ago

Hello Miguel It looks like your fasta file does not contain the chromosome name (in this case 4) This is what the warning tells you when it mentions that chromosome 4 is not found. Could you try to reformat your 4.fa fasta file like:

>4
ATCGATCG...

Try again and let me know what you have.

malj390 commented 5 years ago

I did that but in the posting the symbol ">" disappeared. It should be >1 in that example. I did it with all of them. And still is giving the same error.

Another question will be, can I do the command with bedtools for all the fasta files in the folder?

Thank you!

malj390 commented 5 years ago

Ok It is working but I got this error now. It is because some of the positions are in the opposite strand but I'm already specifying the strand in an extra column.

bedtools getfasta -fi 4.fa  -bed intron_ensembl.bed -fo introns_sequences.fasta -fullHeader                             index file 4.fa.fai not found, generating...                                                                            
Error: malformed BED entry at line 10. Start was greater than end. Exiting. 
mgalland commented 5 years ago

Could you display the first 10 lines of your bed files? You can use the Markdown formatting to display your table in this thread in a readable way. Please check the following page: https://help.github.com/articles/organizing-information-with-tables/

malj390 commented 5 years ago

Fixed! It seems bedtools does only read from downstream position (start) to upstream position (end). So I changed the order of the start and end positions for those sequences that were in the negative strand.

Anyway, I checked that primary_assembly file contains the same info from each chromosome but into one file. I changed the name too and try to run it but it doesn't work. I would like to have only one input and only one output with all the sequences.

Thanks in advanced!