comprna / MoSEA

Motif Scan and Enrichment Analysis (MoSEA)
ISC License
16 stars 10 forks source link

mosea.py scan #7

Open sk1350 opened 4 years ago

sk1350 commented 4 years ago

Hello I have been trying to run MoSEA/mosea.py scan on the test files and I get this error. python MoSEA/mosea.py scan --pfm --pfm_path MoSEA/test_files/motifs/pfms/ --fasta fafile_reg --out_dir fmopfm_outdir --count scanning Motifs on file: fafile_reg 121/121[==================================================] 100%
Scanned 121 motif(s). Output saved in dir: fmopfm_outdir ('fafile_reg', 'MoSEA/test_files/motifs/pfms/', 'fmopfm_outdir') ERROR: Counting Motifs on file: fafile_reg 1/38[= ] 2% Error in parsing: "['sequence name'] not in index" I understand that the issue must be with parsing pfm files as the error comes from the count_motif function but I don’t understand why.

EduEyras commented 4 years ago

Hi,

thanks for your query. Have you checked whether all needed files are correctly defined? I cc Dr. Singh in case she can provide any further insights into this error. Thanks Eduardo

On Thu, 2 Jul 2020 at 05:52, sk1350 notifications@github.com wrote:

python MoSEA/mosea.py scan --pfm --pfm_path MoSEA/test_files/motifs/pfms/ --fasta fafile_reg --out_dir fmopfm_outdir --count scanning Motifs on file: fafile_reg 121/121[==================================================] 100% Scanned 121 motif(s). Output saved in dir: fmopfm_outdir ('fafile_reg', 'MoSEA/test_files/motifs/pfms/', 'fmopfm_outdir') ERROR: Counting Motifs on file: fafile_reg 1/38[= ] 2% Error in parsing: "['sequence name'] not in index" I understand that the issue must be with parsing pfm files as the error comes from the count_motif function but I don’t understand why.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/comprna/MoSEA/issues/7, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB3VF2GDCCIHTZMFLLLRZOHXNANCNFSM4ON5VP2A .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

sk1350 commented 4 years ago

Hello,

Could it potentially be an issue with FIMO not being in my path? Other files are all correctly defined and are all in the same directory so I am unsure what the issue could be.

Thanks, Sofia

EduEyras commented 4 years ago

Hi,

Thanks for checking. Good point, that could be a possibility.

MOSEA with matrices will try to run FIMO with the corresponding set of matrices. Either of those two things might not be visible. Does it work with k-mers?

E

On Thu, 2 Jul 2020 at 21:05, sk1350 notifications@github.com wrote:

Hello,

Could it potentially be an issue with FIMO not being in my path? Other files are all correctly defined and are all in the same directory so I am unsure what the issue could be.

Thanks, Sofia

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/MoSEA/issues/7#issuecomment-652941009, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB44FIXJPU5NSNYD5U3RZRSXTANCNFSM4ON5VP2A .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

sk1350 commented 4 years ago

Hello,

So I checked my fa file and it seems to be in the wrong format.

ENSG00000015475;SE:chr22:18222254-18222848:18222942-18226569:-;down::chr22:18222647-18222847(-) GTACTCGGGCAGGGGCAGCACGGAGGCTGTGCGCCAGAGGAGGAGGACTGAGGGGCAAGGGGGAGAGCTCTGGTTGGAAAGGCAGGGGAGATTCTCCAGGGCCTTGCCGGTGCCAGTGACAACTGGGGTTTTCCTGAGACGGGACTGCGAGGAATGGGGGCTCTCAGGCTTGAGAGGGCAAAAGTGGGTCTGGGATGCCG

Compared to example fa file

ENSG00000015475;SE:chr22:18222254-18222848:18222942-18226569:-;down GTACTCGGGCAGGGGCAGCACGGAGGCTGTGCGCCAGAGGAGGAGGACTGAGGGGCAAGGGGGAGAGCTCTGGTTGGAAAGGCAGGGGAGATTCTCCAGGGCCTTGCCGGTGCCAGTGACAACTGGGGTTTTCCTGAGACGGGACTGCGAGGAATGGGGGCTCTCAGGCTTGAGAGGGCAAAAGTGGGTCTGGGATGCCG

Which is probably why I now get this error Error in parsing: Length of passed values is 1, index implies 2 However I have been using the example code.

python MoSEA/mosealib/suppa_to_bed.py --ifile MoSEA/test_files/infile/control_events_chr22.ids --event SE --ext 200 --ofile TEST_events_bedfile python MoSEA/mosea.py getfasta --bedfile TEST_events_bedfile --genome hg19.fa --output TEST_fafile

Would really appreciate some help on this Sofia

sk1350 commented 4 years ago

I used the UCSC genome

babisingh commented 4 years ago

Hi, I am not sure where this extra value is coming from ':chr22:18222647-18222847(-)' . This looks like id for 200 bases downstream being concatenated, could you please cross check the input file preliminary to this step, if they are properly tab separated?

Thanks, Babita

sk1350 commented 4 years ago

The bed file (TEST_events_bedfile) looks like this chr22 45944292 45944492 ENSG00000077942;SE:chr22:45943084-45944493:45944624-45946372:+;up 0 + chr22 45944492 45944624 ENSG00000077942;SE:chr22:45943084-45944493:45944624-45946372:+;E 0 +

babisingh commented 4 years ago

Yes, this input file is correct.

sk1350 commented 4 years ago

Do you have any suggestions as to why the following step (python MoSEA/mosea.py getfasta --bedfile TEST_events_bedfile --genome hg19.fa --output TEST_fafile) produces a file of the wrong format?

EduEyras commented 4 years ago

It's not an issue of space vs tab? I was wondering whether the program is expecting a tab and you have spaces, or the other way around, and the IDs may get mixed up.

Adding an extra bit to the ID has usually to do with the input not following the expected format, hidden control characters, etc..

I hope this helps

E

On Thu, 2 Jul 2020 at 23:25, sk1350 notifications@github.com wrote:

The bed file (TEST_events_bedfile) looks like this chr22 45944292 45944492 ENSG00000077942;SE:chr22:45943084-45944493:45944624-45946372:+;up 0 + chr22 45944492 45944624 ENSG00000077942;SE:chr22:45943084-45944493:45944624-45946372:+;E 0 +

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/comprna/MoSEA/issues/7#issuecomment-653004229, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCZKB5ONENAHFYNPFLEGRLRZSDFRANCNFSM4ON5VP2A .

-- Prof. E Eyras EMBL Australia Group Leader The John Curtin School of Medical Research - Australian National University https://github.com/comprna http://scholar.google.com/citations?user=LiojlGoAAAAJ

MSajek commented 2 years ago

Hi, ~1,5 year later I encountered the same error. It looks that the first reason is the header of .fmo files. Newer versions of FIMO may use another headers in output tables. So you have to carefully check headers in your .fmo files. In my case it was 'sequence_name' instead of 'sequence name'. To produce .fmo files run the mosea.py scan without --count flag. To fix an error I change the header name in line 60 of controller.py Additionally I have to change line 63 to avoid error that occurred after fixing header name. New line 63: df['count'] = df['event'].groupby(df['event']).transform('count') I also add a new line after line 67, which change datatype from float to int: df = df.astype({'count':'int'}) Modified version of _create_motif_count_list function in script controller.py below:

def _create_motif_count_list(motif_count_file, df_seq):

_check_file(motif_count_file)

#create dictonary for motif
list_fmo_fa_count = []

df = pd.read_csv(motif_count_file, sep = "\t", header=0)
df = df[['sequence_name']]
df.columns = ['event']

df['count'] = df['event'].groupby(df['event']).transform('count')
df = df.drop_duplicates()

df = pd.merge(df_seq, df,  on="event", how = 'outer')
df = df.fillna(0)
df = df.astype({'count':'int'})

dflist_count = df['count'].tolist()
return dflist_count

Hopefully it will be helpful. Marcin